Recognition: 2 theorem links
· Lean TheoremVoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Pith reviewed 2026-05-13 08:52 UTC · model grok-4.3
The pith
Large language models write code to compose 3D value maps that let robots execute free-form manipulation tasks without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By observing that LLMs excel at inferring affordances and constraints from free-form instructions, the work shows they can further leverage code generation to compose 3D value maps through interaction with a VLM; the resulting maps ground linguistic knowledge into the agent's observation space and enable zero-shot synthesis of closed-loop robot trajectories that remain robust to perturbations, with an optional online learning step for contact-rich dynamics.
What carries the argument
Composable 3D value maps produced by LLM-generated code that queries a VLM, encoding task-specific affordances and constraints for model-based planning.
If this is right
- Zero-shot closed-loop trajectories become possible for a large variety of manipulation tasks without pre-defined primitives.
- Online experience can be used to learn dynamics models for contact-rich interactions without full retraining.
- The same framework supports both simulated and real-robot environments across open sets of objects and instructions.
- Robustness to dynamic perturbations arises directly from the closed-loop use of the composed value maps.
Where Pith is reading between the lines
- If value-map composition proves stable, the same code-generation interface could be applied to other embodied domains such as mobile navigation or multi-arm coordination.
- Improving the underlying VLM's spatial grounding would directly increase the precision of the resulting trajectories in cluttered real-world scenes.
- The approach implies that future systems might replace large collections of task-specific controllers with a single general value-map composer driven by language.
Load-bearing premise
LLMs can reliably infer correct affordances and constraints from arbitrary language instructions and translate them into accurate 3D value maps through code that works for any open-set objects and scenes.
What would settle it
Running the system on a novel instruction such as 'stack the blue cylinder on the yellow cube' with previously unseen objects and measuring whether the generated trajectory collides or fails to make contact despite the planner receiving the value map.
read the original abstract
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VoxPoser, a framework that leverages LLMs' code-writing abilities to interact with a VLM and compose 3D value maps from free-form natural language instructions and open-set objects. These value maps ground affordances and constraints into the robot's observation space and are used within a model-based planning loop to synthesize closed-loop 6-DoF end-effector trajectories without pre-defined motion primitives. The approach is extended with online dynamics learning for contact-rich scenes and is validated through large-scale experiments in both simulation and real-robot settings, claiming robustness to dynamic perturbations across a variety of everyday manipulation tasks.
Significance. If the central claims hold, the work would represent a meaningful step toward language-conditioned robotic manipulation that avoids hand-crafted primitives and achieves zero-shot generalization via grounded value maps. The combination of LLM reasoning with VLM-based 3D composition and model-based planning offers a concrete mechanism for translating high-level instructions into executable trajectories, with potential implications for scalable, open-vocabulary robot control.
major comments (3)
- [Experiments] The central claim that LLM-generated code reliably composes accurate 3D value maps via VLM interaction for arbitrary open-set instructions is load-bearing, yet the evaluation provides only aggregate task success rates without a quantitative breakdown of composition success rate, code execution failures, or VLM grounding errors (see Experiments section and associated tables).
- [§4.2] Robustness to dynamic perturbations is asserted for the closed-loop planner, but no analysis quantifies how errors in the composed value maps (arising from VLM spatial inaccuracies or LLM-inferred constraints) propagate through the planning pipeline (see §4.2 on model-based planning and the perturbation experiments).
- [Dynamics learning subsection] The online dynamics model learning for contact-rich interactions is presented as an efficient extension, but the manuscript does not report sample complexity, convergence behavior, or ablation results isolating its contribution to overall performance (see the dynamics learning subsection).
minor comments (2)
- [Method] Notation for the 3D value map composition step could be clarified with an explicit equation or pseudocode block showing how LLM output interfaces with VLM queries.
- [Figures] Figure captions for the real-robot results should include the exact number of trials and success criteria to allow direct comparison with simulation results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and commit to revisions that strengthen the empirical support for our claims without altering the core contributions.
read point-by-point responses
-
Referee: [Experiments] The central claim that LLM-generated code reliably composes accurate 3D value maps via VLM interaction for arbitrary open-set instructions is load-bearing, yet the evaluation provides only aggregate task success rates without a quantitative breakdown of composition success rate, code execution failures, or VLM grounding errors (see Experiments section and associated tables).
Authors: We agree that a granular breakdown of the composition pipeline would better substantiate the central claim. While aggregate task success rates already reflect end-to-end reliability across open-set instructions and objects, we will add a dedicated analysis subsection (and accompanying table) that reports: (i) per-task composition success rate (value map validity judged by VLM consistency checks), (ii) code execution failure rates, and (iii) VLM grounding error statistics (spatial offset distributions). These metrics will be computed from the same experimental logs used for the original tables. revision: yes
-
Referee: [§4.2] Robustness to dynamic perturbations is asserted for the closed-loop planner, but no analysis quantifies how errors in the composed value maps (arising from VLM spatial inaccuracies or LLM-inferred constraints) propagate through the planning pipeline (see §4.2 on model-based planning and the perturbation experiments).
Authors: The closed-loop replanning loop is intended to absorb moderate value-map inaccuracies, yet we acknowledge the absence of explicit propagation analysis. In the revision we will insert a new paragraph and figure in §4.2 that quantifies sensitivity: we will inject controlled spatial noise into the composed value maps (matching observed VLM error distributions) and measure resulting changes in planning success rate and trajectory smoothness across the perturbation experiments. This will directly illustrate the planner’s tolerance to the error sources mentioned. revision: yes
-
Referee: [Dynamics learning subsection] The online dynamics model learning for contact-rich interactions is presented as an efficient extension, but the manuscript does not report sample complexity, convergence behavior, or ablation results isolating its contribution to overall performance (see the dynamics learning subsection).
Authors: We will expand the dynamics-learning subsection with the requested details: learning curves showing sample complexity and convergence (mean-squared prediction error vs. number of interaction steps), plus an ablation table comparing task success rates with and without the learned dynamics model on the contact-rich subset of tasks. These additions will isolate the contribution of the online learning component. revision: yes
Circularity Check
No circularity: framework composes external LLM/VLM outputs into value maps without self-referential reduction
full rationale
The paper presents a system that invokes pre-trained LLMs for affordance inference and code generation, then runs that code against an external VLM to produce 3D value maps for model-based planning. No equations, fitted parameters, or derivations are shown that reduce the output maps or trajectories back to the same inputs by construction. The central mechanism is a new composition step that depends on the independent capabilities of the cited LLMs and VLMs rather than on any self-citation chain or ansatz smuggled from prior author work. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs excel at inferring affordances and constraints given free-form language instructions
- domain assumption LLMs can interact with VLMs via code to compose accurate 3D value maps
invented entities (1)
-
Composable 3D value maps
no independent evidence
Forward citations
Cited by 26 Pith papers
-
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
-
PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement
PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
-
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
-
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving hi...
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
-
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments
Robots autonomously convert LLM-guided experiences into a reusable local method library, reducing average execution time from 7.7772s to 6.7779s and LLM calls per task from 1.0 to 0.2 in repeated-task experiments.
-
Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models
A pipeline links foundation-model intent reasoning to safe trajectory optimization via behavior sequences and waypoint constraints, achieving over 90% convergence and 1.5x better intent satisfaction in close-proximity tests.
-
LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
LAMP extracts continuous 3D inter-object transformations from image editing to serve as geometry-aware priors for zero-shot open-world robotic manipulation.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
Octo: An Open-Source Generalist Robot Policy
Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...
-
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
-
Visibility-Aware Mobile Grasping in Dynamic Environments
A visibility-aware mobile grasping system with iterative whole-body planning and behavior-tree subgoal generation achieves 68.8% success in unknown static and 58% in dynamic environments, outperforming a baseline by 2...
-
AnyUser: Translating Sketched User Intent into Domestic Robots
AnyUser translates free-form sketches on images plus optional language into executable robot actions for domestic tasks using multimodal fusion and a hierarchical policy.
-
ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-...
-
Visibility-Aware Mobile Grasping in Dynamic Environments
A unified visibility-aware mobile grasping system using whole-body planning, active perception, and behavior trees improves success rates in unknown static and dynamic environments.
-
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [5]
- [6]
- [7]
-
[8]
M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus. Interpreting and executing recipes with a cooking robot. In Experimental Robotics, pages 481–495. Springer, 2013
work page 2013
- [9]
-
[10]
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [11]
-
[12]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language super- vision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 9
work page 2021
-
[13]
X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021
- [14]
-
[15]
M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Ma- hendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022
-
[16]
C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R´ıo, M. Wiebe, P. Peterson, P. G´erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programm...
-
[17]
Y . K. Hwang, N. Ahuja, et al. A potential field approach to path planning.IEEE transactions on robotics and automation, 8(1):23–32, 1992
work page 1992
-
[18]
M. Toussaint, J. Harris, J.-S. Ha, D. Driess, and W. H ¨onig. Sequence-of-constraints mpc: Reactive timing-optimal control of sequential manipulation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13753–13760. IEEE, 2022
work page 2022
-
[19]
J. Andreas, D. Klein, and S. Levine. Learning with latent language. arXiv preprint arXiv:1711.00482, 2017
-
[20]
R. Zellers, A. Holtzman, M. Peters, R. Mottaghi, A. Kembhavi, A. Farhadi, and Y . Choi. Piglet: Language grounding through neuro-symbolic interaction in a 3d world.arXiv preprint arXiv:2106.00188, 2021
-
[21]
R. Zellers, X. Lu, J. Hessel, Y . Yu, J. S. Park, J. Cao, A. Farhadi, and Y . Choi. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 2021
work page 2021
-
[22]
V . Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y . Choi. Unsupervised commonsense question answering with self-talk. arXiv preprint arXiv:2004.05483, 2020
- [23]
- [24]
- [25]
- [26]
-
[27]
J. Thomason, S. Zhang, R. J. Mooney, and P. Stone. Learning to interpret natural language commands through human-robot dialog. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015
work page 2015
-
[28]
J. Thomason, A. Padmakumar, J. Sinapov, N. Walker, Y . Jiang, H. Yedidsion, J. Hart, P. Stone, and R. Mooney. Jointly improving parsing and perception for natural language commands through human-robot dialog. Journal of Artificial Intelligence Research, 67:327–374, 2020. 10
work page 2020
-
[29]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2021
work page 2021
-
[30]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [31]
- [32]
- [33]
- [34]
-
[35]
Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image repre- sentations and rewards for robotic control. arXiv e-prints, 2023
work page 2023
- [36]
-
[37]
V . Micheli and F. Fleuret. Language models are few-shot butlers. arXiv preprint arXiv:2104.07972, 2021
- [38]
-
[39]
Inner Monologue: Embodied Reasoning through Planning with Language Models
W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mor- datch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter. Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
B. Z. Li, W. Chen, P. Sharma, and J. Andreas. Lampp: Language models as probabilistic priors for perception and action. arXiv e-prints, pages arXiv–2302, 2023
work page 2023
-
[41]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009
work page 2009
-
[42]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu- tional neural networks. Communications of the ACM, 60(6):84–90, 2017
work page 2017
-
[43]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
-
[44]
S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn, et al. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022
work page 2022
-
[45]
M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manip- ulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022
work page 2022
-
[46]
M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022. 11
work page 2022
- [47]
- [48]
- [49]
- [50]
-
[51]
W. Liu, C. Paxton, T. Hermans, and D. Fox. Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. In 2022 International Conference on Robotics and Automation (ICRA), pages 6322–6329. IEEE, 2022
work page 2022
-
[52]
C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. Robotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648
-
[53]
Interactive Language : Talking to robots in real time
C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022
-
[54]
L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manip- ulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021
work page 2021
-
[55]
J. Luketina, N. Nardelli, G. Farquhar, J. N. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rockt ¨aschel. A survey of reinforcement learning informed by natural language. In IJCAI, 2019
work page 2019
-
[56]
J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. ArXiv, abs/1611.01796, 2017
- [57]
- [58]
-
[59]
Progprompt: Generating situ- ated robot task plans using large language models,
I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022
- [60]
- [61]
- [62]
-
[63]
B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023. 12
work page internal anchor Pith review arXiv 2023
-
[64]
S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor. Chatgpt for robotics: Design principles and model abilities. 2023, 2023
work page 2023
- [65]
- [66]
- [67]
-
[68]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [69]
- [70]
- [71]
- [72]
-
[73]
G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandku- mar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [74]
-
[75]
Code as policies: Language model programs for embodied control.arXiv preprint arXiv:2209.07753,
J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022
- [76]
-
[77]
A. Tam, N. Rabinowitz, A. Lampinen, N. A. Roy, S. Chan, D. Strouse, J. Wang, A. Banino, and F. Hill. Semantic exploration from language abstractions and pretrained representations. Advances in Neural Information Processing Systems, 35:25377–25389, 2022
work page 2022
- [78]
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.