pith. machine review for the scientific record. sign in

arxiv: 2307.05973 · v2 · submitted 2023-07-12 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Chen Wang, Jiajun Wu, Li Fei-Fei, Ruohan Zhang, Wenlong Huang, Yunzhu Li

Pith reviewed 2026-05-13 08:52 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG
keywords robotic manipulationlanguage models3D value mapszero-shot learningvision-language modelstrajectory planningclosed-loop controlaffordance inference
0
0 comments X

The pith

Large language models write code to compose 3D value maps that let robots execute free-form manipulation tasks without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs can extract actionable knowledge from language instructions by inferring affordances and constraints, then use code-writing to interact with a vision-language model and build 3D value maps. These maps ground the knowledge directly into the robot's sensor observations, so a model-based planner can produce dense sequences of 6-DoF end-effector waypoints. The resulting closed-loop trajectories handle a wide range of everyday tasks on open sets of objects and instructions while remaining robust to dynamic changes in the scene. Readers would care because the method removes the need for hand-designed motion primitives or task-specific training data, replacing them with a general interface between language reasoning and physical control.

Core claim

By observing that LLMs excel at inferring affordances and constraints from free-form instructions, the work shows they can further leverage code generation to compose 3D value maps through interaction with a VLM; the resulting maps ground linguistic knowledge into the agent's observation space and enable zero-shot synthesis of closed-loop robot trajectories that remain robust to perturbations, with an optional online learning step for contact-rich dynamics.

What carries the argument

Composable 3D value maps produced by LLM-generated code that queries a VLM, encoding task-specific affordances and constraints for model-based planning.

If this is right

  • Zero-shot closed-loop trajectories become possible for a large variety of manipulation tasks without pre-defined primitives.
  • Online experience can be used to learn dynamics models for contact-rich interactions without full retraining.
  • The same framework supports both simulated and real-robot environments across open sets of objects and instructions.
  • Robustness to dynamic perturbations arises directly from the closed-loop use of the composed value maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If value-map composition proves stable, the same code-generation interface could be applied to other embodied domains such as mobile navigation or multi-arm coordination.
  • Improving the underlying VLM's spatial grounding would directly increase the precision of the resulting trajectories in cluttered real-world scenes.
  • The approach implies that future systems might replace large collections of task-specific controllers with a single general value-map composer driven by language.

Load-bearing premise

LLMs can reliably infer correct affordances and constraints from arbitrary language instructions and translate them into accurate 3D value maps through code that works for any open-set objects and scenes.

What would settle it

Running the system on a novel instruction such as 'stack the blue cylinder on the yellow cube' with previously unseen objects and measuring whether the generated trajectory collides or fails to make contact despite the planner receiving the value map.

read the original abstract

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VoxPoser, a framework that leverages LLMs' code-writing abilities to interact with a VLM and compose 3D value maps from free-form natural language instructions and open-set objects. These value maps ground affordances and constraints into the robot's observation space and are used within a model-based planning loop to synthesize closed-loop 6-DoF end-effector trajectories without pre-defined motion primitives. The approach is extended with online dynamics learning for contact-rich scenes and is validated through large-scale experiments in both simulation and real-robot settings, claiming robustness to dynamic perturbations across a variety of everyday manipulation tasks.

Significance. If the central claims hold, the work would represent a meaningful step toward language-conditioned robotic manipulation that avoids hand-crafted primitives and achieves zero-shot generalization via grounded value maps. The combination of LLM reasoning with VLM-based 3D composition and model-based planning offers a concrete mechanism for translating high-level instructions into executable trajectories, with potential implications for scalable, open-vocabulary robot control.

major comments (3)
  1. [Experiments] The central claim that LLM-generated code reliably composes accurate 3D value maps via VLM interaction for arbitrary open-set instructions is load-bearing, yet the evaluation provides only aggregate task success rates without a quantitative breakdown of composition success rate, code execution failures, or VLM grounding errors (see Experiments section and associated tables).
  2. [§4.2] Robustness to dynamic perturbations is asserted for the closed-loop planner, but no analysis quantifies how errors in the composed value maps (arising from VLM spatial inaccuracies or LLM-inferred constraints) propagate through the planning pipeline (see §4.2 on model-based planning and the perturbation experiments).
  3. [Dynamics learning subsection] The online dynamics model learning for contact-rich interactions is presented as an efficient extension, but the manuscript does not report sample complexity, convergence behavior, or ablation results isolating its contribution to overall performance (see the dynamics learning subsection).
minor comments (2)
  1. [Method] Notation for the 3D value map composition step could be clarified with an explicit equation or pseudocode block showing how LLM output interfaces with VLM queries.
  2. [Figures] Figure captions for the real-robot results should include the exact number of trials and success criteria to allow direct comparison with simulation results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and commit to revisions that strengthen the empirical support for our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Experiments] The central claim that LLM-generated code reliably composes accurate 3D value maps via VLM interaction for arbitrary open-set instructions is load-bearing, yet the evaluation provides only aggregate task success rates without a quantitative breakdown of composition success rate, code execution failures, or VLM grounding errors (see Experiments section and associated tables).

    Authors: We agree that a granular breakdown of the composition pipeline would better substantiate the central claim. While aggregate task success rates already reflect end-to-end reliability across open-set instructions and objects, we will add a dedicated analysis subsection (and accompanying table) that reports: (i) per-task composition success rate (value map validity judged by VLM consistency checks), (ii) code execution failure rates, and (iii) VLM grounding error statistics (spatial offset distributions). These metrics will be computed from the same experimental logs used for the original tables. revision: yes

  2. Referee: [§4.2] Robustness to dynamic perturbations is asserted for the closed-loop planner, but no analysis quantifies how errors in the composed value maps (arising from VLM spatial inaccuracies or LLM-inferred constraints) propagate through the planning pipeline (see §4.2 on model-based planning and the perturbation experiments).

    Authors: The closed-loop replanning loop is intended to absorb moderate value-map inaccuracies, yet we acknowledge the absence of explicit propagation analysis. In the revision we will insert a new paragraph and figure in §4.2 that quantifies sensitivity: we will inject controlled spatial noise into the composed value maps (matching observed VLM error distributions) and measure resulting changes in planning success rate and trajectory smoothness across the perturbation experiments. This will directly illustrate the planner’s tolerance to the error sources mentioned. revision: yes

  3. Referee: [Dynamics learning subsection] The online dynamics model learning for contact-rich interactions is presented as an efficient extension, but the manuscript does not report sample complexity, convergence behavior, or ablation results isolating its contribution to overall performance (see the dynamics learning subsection).

    Authors: We will expand the dynamics-learning subsection with the requested details: learning curves showing sample complexity and convergence (mean-squared prediction error vs. number of interaction steps), plus an ablation table comparing task success rates with and without the learned dynamics model on the contact-rich subset of tasks. These additions will isolate the contribution of the online learning component. revision: yes

Circularity Check

0 steps flagged

No circularity: framework composes external LLM/VLM outputs into value maps without self-referential reduction

full rationale

The paper presents a system that invokes pre-trained LLMs for affordance inference and code generation, then runs that code against an external VLM to produce 3D value maps for model-based planning. No equations, fitted parameters, or derivations are shown that reduce the output maps or trajectories back to the same inputs by construction. The central mechanism is a new composition step that depends on the independent capabilities of the cited LLMs and VLMs rather than on any self-citation chain or ansatz smuggled from prior author work. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that LLMs possess extractable actionable knowledge for affordances and that code-writing plus VLM interaction can reliably produce usable 3D value maps; no free parameters or invented physical entities are described.

axioms (2)
  • domain assumption LLMs excel at inferring affordances and constraints given free-form language instructions
    Stated directly in the abstract as the starting observation enabling the method.
  • domain assumption LLMs can interact with VLMs via code to compose accurate 3D value maps
    Core mechanism claimed in the abstract without further justification provided.
invented entities (1)
  • Composable 3D value maps no independent evidence
    purpose: To ground LLM-inferred knowledge into the agent's 3D observation space for planning
    New construct introduced by the paper to bridge language and control

pith-pipeline@v0.9.0 · 5574 in / 1416 out tokens · 38495 ms · 2026-05-13T08:52:38.975647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...

  2. PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

    cs.RO 2026-04 unverdicted novelty 7.0

    PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.

  3. Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

    cs.CV 2026-04 conditional novelty 7.0

    Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.

  4. How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

    cs.AI 2026-04 unverdicted novelty 7.0

    Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

  5. Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution

    cs.RO 2026-04 unverdicted novelty 7.0

    A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.

  6. Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...

  7. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    cs.RO 2023-10 unverdicted novelty 7.0

    A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.

  8. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  9. From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...

  10. Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.

  11. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  12. An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

    cs.RO 2026-04 unverdicted novelty 6.0

    Robots autonomously convert LLM-guided experiences into a reusable local method library, reducing average execution time from 7.7772s to 6.7779s and LLM calls per task from 1.0 to 0.2 in repeated-task experiments.

  13. Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models

    eess.SY 2026-04 unverdicted novelty 6.0

    A pipeline links foundation-model intent reasoning to safe trajectory optimization via behavior sequences and waypoint constraints, achieving over 90% convergence and 1.5x better intent satisfaction in close-proximity tests.

  14. LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

    cs.CV 2026-04 unverdicted novelty 6.0

    LAMP extracts continuous 3D inter-object transformations from image editing to serve as geometry-aware priors for zero-shot open-world robotic manipulation.

  15. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  16. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    cs.RO 2025-02 accept novelty 6.0

    OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.

  17. Octo: An Open-Source Generalist Robot Policy

    cs.RO 2024-05 unverdicted novelty 6.0

    Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.

  18. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  19. BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...

  20. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...

  21. Visibility-Aware Mobile Grasping in Dynamic Environments

    cs.RO 2026-05 unverdicted novelty 5.0

    A visibility-aware mobile grasping system with iterative whole-body planning and behavior-tree subgoal generation achieves 68.8% success in unknown static and 58% in dynamic environments, outperforming a baseline by 2...

  22. AnyUser: Translating Sketched User Intent into Domestic Robots

    cs.RO 2026-04 unverdicted novelty 5.0

    AnyUser translates free-form sketches on images plus optional language into executable robot actions for domestic tasks using multimodal fusion and a hierarchical policy.

  23. ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

    cs.RO 2026-04 unverdicted novelty 5.0

    ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-...

  24. Visibility-Aware Mobile Grasping in Dynamic Environments

    cs.RO 2026-05 unverdicted novelty 4.0

    A unified visibility-aware mobile grasping system using whole-body planning, active perception, and behavior trees improves success rates in unknown static and dynamic environments.

  25. Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

    cs.RO 2026-04 accept novelty 4.0

    A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.

  26. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

138 extracted references · 138 canonical work pages · cited by 24 Pith papers · 18 internal anchors

  1. [1]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. arXiv, 2023

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  4. [4]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  5. [5]

    Tellex, N

    S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek. Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems, 2020

  6. [6]

    Tellex, T

    S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Banerjee, S. Teller, and N. Roy. Under- standing natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 1507–1514, 2011

  7. [7]

    Kollar, S

    T. Kollar, S. Tellex, D. Roy, and N. Roy. Toward understanding natural language directions. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 259–266. IEEE, 2010

  8. [8]

    Bollini, S

    M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus. Interpreting and executing recipes with a cooking robot. In Experimental Robotics, pages 481–495. Springer, 2013

  9. [9]

    Huang, P

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Ma- chine Learning. PMLR, 2022

  10. [10]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. ...

  11. [11]

    A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V . Sind- hwani, J. Lee, V . Vanhoucke, et al. Socratic models: Composing zero-shot multimodal rea- soning with language. arXiv preprint arXiv:2204.00598, 2022

  12. [12]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language super- vision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 9

  13. [13]

    Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

    X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021

  14. [14]

    Kamath, M

    A. Kamath, M. Singh, Y . LeCun, G. Synnaeve, I. Misra, and N. Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 1780–1790, 2021

  15. [15]

    Minderer, A

    M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Ma- hendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022

  16. [16]

    C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R´ıo, M. Wiebe, P. Peterson, P. G´erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programm...

  17. [17]

    Y . K. Hwang, N. Ahuja, et al. A potential field approach to path planning.IEEE transactions on robotics and automation, 8(1):23–32, 1992

  18. [18]

    Toussaint, J

    M. Toussaint, J. Harris, J.-S. Ha, D. Driess, and W. H ¨onig. Sequence-of-constraints mpc: Reactive timing-optimal control of sequential manipulation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13753–13760. IEEE, 2022

  19. [19]

    Andreas, D

    J. Andreas, D. Klein, and S. Levine. Learning with latent language. arXiv preprint arXiv:1711.00482, 2017

  20. [20]

    Zellers, A

    R. Zellers, A. Holtzman, M. Peters, R. Mottaghi, A. Kembhavi, A. Farhadi, and Y . Choi. Piglet: Language grounding through neuro-symbolic interaction in a 3d world.arXiv preprint arXiv:2106.00188, 2021

  21. [21]

    Zellers, X

    R. Zellers, X. Lu, J. Hessel, Y . Yu, J. S. Park, J. Cao, A. Farhadi, and Y . Choi. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 2021

  22. [22]

    Shwartz, P

    V . Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y . Choi. Unsupervised commonsense question answering with self-talk. arXiv preprint arXiv:2004.05483, 2020

  23. [23]

    Winograd

    T. Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC, 1971

  24. [24]

    Blukis, R

    V . Blukis, R. A. Knepper, and Y . Artzi. Few-shot object grounding and mapping for natural language robot instruction following. arXiv preprint arXiv:2011.07384, 2020

  25. [25]

    Tellex, R

    S. Tellex, R. Knepper, A. Li, D. Rus, and N. Roy. Asking for help using inverse semantics. Robotics: Science and Systems Foundation, 2014

  26. [26]

    Kollar, S

    T. Kollar, S. Tellex, D. Roy, and N. Roy. Grounding verbs of motion in natural language commands to robots. In Experimental robotics, pages 31–47. Springer, 2014

  27. [27]

    Thomason, S

    J. Thomason, S. Zhang, R. J. Mooney, and P. Stone. Learning to interpret natural language commands through human-robot dialog. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015

  28. [28]

    Thomason, A

    J. Thomason, A. Padmakumar, J. Sinapov, N. Walker, Y . Jiang, H. Yedidsion, J. Hart, P. Stone, and R. Mooney. Jointly improving parsing and perception for natural language commands through human-robot dialog. Journal of Artificial Intelligence Research, 67:327–374, 2020. 10

  29. [29]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2021

  30. [30]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  31. [31]

    D. Shah, B. Osinski, B. Ichter, and S. Levine. Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action. arXiv preprint arXiv:2207.04429, 2022

  32. [32]

    Y . Cui, S. Karamcheti, R. Palleti, N. Shivakumar, P. Liang, and D. Sadigh. ” no, to the right”– online language corrections for robotic manipulation via shared autonomy. arXiv preprint arXiv:2301.02555, 2023

  33. [33]

    Stone, T

    A. Stone, T. Xiao, Y . Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn, et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023

  34. [34]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022

  35. [35]

    Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image repre- sentations and rewards for robotic control. arXiv e-prints, 2023

  36. [36]

    P. A. Jansen. Visually-grounded planning without vision: Language models infer detailed plans from high-level instructions. arXiv preprint arXiv:2009.14259, 2020

  37. [37]

    Micheli and F

    V . Micheli and F. Fleuret. Language models are few-shot butlers. arXiv preprint arXiv:2104.07972, 2021

  38. [38]

    Sharma, A

    P. Sharma, A. Torralba, and J. Andreas. Skill induction and planning with latent language. arXiv preprint arXiv:2110.01517, 2021

  39. [39]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mor- datch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter. Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022

  40. [40]

    B. Z. Li, W. Chen, P. Sharma, and J. Andreas. Lampp: Language models as probabilistic priors for perception and action. arXiv e-prints, pages arXiv–2302, 2023

  41. [41]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009

  42. [42]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu- tional neural networks. Communications of the ACM, 60(6):84–90, 2017

  43. [43]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  44. [44]

    S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn, et al. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022

  45. [45]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manip- ulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022

  46. [46]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022. 11

  47. [47]

    S. Li, X. Puig, Y . Du, C. Wang, E. Akyurek, A. Torralba, J. Andreas, and I. Mordatch. Pre- trained language models for interactive decision-making. arXiv preprint arXiv:2202.01771, 2022

  48. [48]

    O. Mees, L. Hermann, and W. Burgard. What matters in language conditioned robotic imita- tion learning. arXiv preprint arXiv:2204.06252, 2022

  49. [49]

    O. Mees, J. Borja-Diaz, and W. Burgard. Grounding language with visual affordances over unstructured data. arXiv preprint arXiv:2210.01911, 2022

  50. [50]

    Sharma, B

    P. Sharma, B. Sundaralingam, V . Blukis, C. Paxton, T. Hermans, A. Torralba, J. An- dreas, and D. Fox. Correcting robot plans with natural language feedback. arXiv preprint arXiv:2204.05186, 2022

  51. [51]

    W. Liu, C. Paxton, T. Hermans, and D. Fox. Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. In 2022 International Conference on Robotics and Automation (ICRA), pages 6322–6329. IEEE, 2022

  52. [52]

    and Sermanet, P

    C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. Robotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648

  53. [53]

    Interactive Language : Talking to robots in real time

    C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022

  54. [54]

    L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manip- ulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021

  55. [55]

    Luketina, N

    J. Luketina, N. Nardelli, G. Farquhar, J. N. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rockt ¨aschel. A survey of reinforcement learning informed by natural language. In IJCAI, 2019

  56. [56]

    Andreas, D

    J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. ArXiv, abs/1611.01796, 2017

  57. [57]

    Jiang, S

    Y . Jiang, S. S. Gu, K. P. Murphy, and C. Finn. Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019

  58. [58]

    B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler. Open-vocabulary queryable scene representations for real world planning. arXiv preprint arXiv:2209.09874, 2022

  59. [59]

    Progprompt: Generating situ- ated robot task plans using large language models,

    I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022

  60. [60]

    Huang, O

    C. Huang, O. Mees, A. Zeng, and W. Burgard. Visual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022

  61. [61]

    S. S. Raman, V . Cohen, E. Rosen, I. Idrees, D. Paulius, and S. Tellex. Planning with large language models via corrective re-prompting. arXiv preprint arXiv:2211.09935, 2022

  62. [62]

    C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su. Llm-planner: Few- shot grounded planning for embodied agents with large language models. arXiv preprint arXiv:2212.04088, 2022

  63. [63]

    B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023. 12

  64. [64]

    Vemprala, R

    S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor. Chatgpt for robotics: Design principles and model abilities. 2023, 2023

  65. [65]

    K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023

  66. [66]

    Y . Ding, X. Zhang, C. Paxton, and S. Zhang. Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023

  67. [67]

    Huang, F

    W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y . Lu, P. Florence, I. Mordatch, S. Levine, K. Hausman, et al. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023

  68. [68]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  69. [69]

    H. Yuan, C. Zhang, H. Wang, F. Xie, P. Cai, H. Dong, and Z. Lu. Plan4mc: Skill reinforce- ment learning and planning for open-world minecraft tasks.arXiv preprint arXiv:2303.16563, 2023

  70. [70]

    Y . Xie, C. Yu, T. Zhu, J. Bai, Z. Gong, and H. Soh. Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128, 2023

  71. [71]

    Y . Lu, P. Lu, Z. Chen, W. Zhu, X. E. Wang, and W. Y . Wang. Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023

  72. [72]

    Patel, H

    D. Patel, H. Eghbalzadeh, N. Kamra, M. L. Iuzzolino, U. Jain, and R. Desai. Pretrained language models as visual planners for human assistance. arXiv preprint arXiv:2304.09179, 2023

  73. [73]

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandku- mar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  74. [74]

    J. Yang, W. Tan, C. Jin, B. Liu, J. Fu, R. Song, and L. Wang. Pave the way to grasp anything: Transferring foundation models for universal pick-place robots. arXiv preprint arXiv:2306.05716, 2023

  75. [75]

    Code as policies: Language model programs for embodied control.arXiv preprint arXiv:2209.07753,

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022

  76. [76]

    M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh. Reward design with language models. arXiv preprint arXiv:2303.00001, 2023

  77. [77]

    A. Tam, N. Rabinowitz, A. Lampinen, N. A. Roy, S. Chan, D. Strouse, J. Wang, A. Banino, and F. Hill. Semantic exploration from language abstractions and pretrained representations. Advances in Neural Information Processing Systems, 35:25377–25389, 2022

  78. [78]

    J. Mu, V . Zhong, R. Raileanu, M. Jiang, N. Goodman, T. Rockt ¨aschel, and E. Grefenstette. Improving intrinsic exploration with language abstractions.arXiv preprint arXiv:2202.08938, 2022

  79. [79]

    Colas, T

    C. Colas, T. Karch, N. Lair, J.-M. Dussoux, C. Moulin-Frier, P. Dominey, and P.-Y . Oudeyer. Language as a cognitive tool to imagine goals in curiosity driven exploration. Advances in Neural Information Processing Systems, 33:3761–3774, 2020. 13

  80. [80]

    Y . Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023

Showing first 80 references.