Recognition: no theorem link
Inner Monologue: Embodied Reasoning through Planning with Language Models
Pith reviewed 2026-05-11 20:04 UTC · model grok-4.3
The pith
Language models improve robotic planning by maintaining an inner monologue of natural language feedback from the environment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating environment feedback as additional natural-language context, large language models can sustain an inner monologue that lets them revise plans in response to the outcomes of their own actions, producing measurably higher success on instruction-following tasks in both simulation and the real world.
What carries the argument
The inner monologue: an iterative loop in which the language model receives and reasons over language descriptions of success detection, scene state, or human input to update its next plan.
If this is right
- Closed-loop language feedback raises completion rates on both simulated and physical tabletop rearrangement.
- The same feedback loop improves long-horizon mobile manipulation in a real kitchen.
- Multiple language feedback sources can be combined without retraining the underlying model.
- Plans adapt dynamically as the world state changes, because the model re-reasons over updated language descriptions.
Where Pith is reading between the lines
- The approach could transfer to other manipulation or navigation domains if their outcomes can be summarized in language.
- Performance may drop in settings where feedback is noisy or incomplete, suggesting a need for verification steps not tested here.
- Combining the monologue with direct visual or proprioceptive inputs might increase robustness beyond what language alone provides.
- The method implies that future robot systems could rely more on general-purpose language models and less on domain-specific fine-tuning.
Load-bearing premise
Large language models can reliably interpret and act on natural language descriptions of task outcomes and world state without any task-specific training.
What would settle it
Re-run the same tabletop and kitchen tasks while removing or corrupting the language feedback channel and measure whether the reported gains in instruction completion disappear.
read the original abstract
Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes that Large Language Models can perform embodied reasoning for robotic planning by incorporating natural language feedback from the environment into an 'inner monologue' without any additional training. The approach is evaluated on three domains: simulated and real tabletop rearrangement tasks, and real-world long-horizon mobile manipulation in a kitchen setting. The central finding is that closed-loop language feedback from sources like success detection, scene description, and human interaction significantly improves high-level instruction completion rates compared to open-loop baselines.
Significance. If the empirical results are robust, this work demonstrates a promising direction for using off-the-shelf LLMs in dynamic robotic control scenarios, leveraging their reasoning capabilities over language-based feedback to handle uncertainty and changes in the environment. It provides evidence across both simulation and real hardware, which strengthens the claim for practical applicability in robotics.
major comments (2)
- The evaluation does not include ablations or stress tests where the feedback modules (success detection, scene description) are perturbed with realistic error rates or partial information. This is load-bearing because the headline improvement relies on the assumption that these external modules provide accurate and complete language feedback that the LLM can effectively reason over.
- Details on the exact prompt construction for integrating multiple feedback sources and how the LLM generates the next action or plan are not sufficiently specified, making it difficult to assess the precise mechanism of the 'inner monologue' or to reproduce the results.
minor comments (1)
- The abstract claims 'significant improvements' but provides no quantitative metrics, baselines, or error bars, which should be summarized even at a high level.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We appreciate the opportunity to clarify and strengthen the manuscript. We address each major comment below and have revised the paper accordingly to improve clarity and robustness.
read point-by-point responses
-
Referee: The evaluation does not include ablations or stress tests where the feedback modules (success detection, scene description) are perturbed with realistic error rates or partial information. This is load-bearing because the headline improvement relies on the assumption that these external modules provide accurate and complete language feedback that the LLM can effectively reason over.
Authors: We agree that controlled stress tests with perturbed feedback would strengthen claims about robustness. Our real-world experiments already use feedback from imperfect perception systems (e.g., object detectors with errors and human-provided scene descriptions), and the inner monologue still yields substantial gains over open-loop baselines in these noisy settings. However, we did not include explicit ablations with injected error rates. In the revised manuscript, we have added a new analysis subsection with simulation results that systematically vary success detection accuracy (0-30% error) and scene description completeness, showing graceful degradation and retained benefits from language-based reasoning up to moderate noise levels. We also discuss failure modes when feedback becomes highly unreliable. revision: yes
-
Referee: Details on the exact prompt construction for integrating multiple feedback sources and how the LLM generates the next action or plan are not sufficiently specified, making it difficult to assess the precise mechanism of the 'inner monologue' or to reproduce the results.
Authors: We acknowledge that the original description of prompt formatting was high-level. The revised manuscript now includes an expanded appendix with the complete prompt templates for each domain and feedback combination. These templates show the exact structure for concatenating success detection outputs, scene descriptions, and human feedback into the LLM context, along with the system instructions and few-shot examples used. We also provide full example inner-monologue traces (input history and LLM-generated plans) for representative episodes, making the closed-loop reasoning process fully reproducible. revision: yes
Circularity Check
No circularity: empirical evaluation of LLM planning with external feedback
full rationale
The paper presents an empirical study of using frozen LLMs for robotic planning augmented by language feedback from separate modules (success detection, scene description, human input). It reports performance gains on rearrangement and mobile manipulation tasks but contains no derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness claims. All results are obtained by direct experimentation on simulated and real hardware; the central claim is therefore not equivalent to its inputs by construction and remains falsifiable by the reported baselines and ablations.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs possess reasoning capabilities applicable to planning and interaction in embodied environments
- domain assumption Natural language feedback from the environment can be effectively processed by LLMs to adjust plans
Forward citations
Cited by 42 Pith papers
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
-
Generative Agents: Interactive Simulacra of Human Behavior
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
-
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
-
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
-
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
-
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
2.5-D Decomposition for LLM-Based Spatial Construction
2.5-D decomposition lets LLMs achieve 94.6% structural accuracy on a building benchmark by handling only horizontal planning while a symbolic system manages vertical placements from occupancy.
-
Milestone-Guided Policy Learning for Long-Horizon Language Agents
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...
-
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.
-
Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense
Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
-
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.
-
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
-
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
Rabtriever distills a generative reranker into an efficient independent encoder using JEPA and auxiliary reverse KL loss to achieve linear complexity and strong performance on rationale-based retrieval tasks.
-
Long-Horizon Manipulation via Trace-Conditioned VLA Planning
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
-
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
-
HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.
-
An Edge-Host-Cloud Architecture for Robot-Agnostic, Caregiver-in-the-Loop Personalized Cognitive Exercise: Multi-Site Deployment in Dementia Care
Speaking Memories is a robot-agnostic edge-host-cloud architecture for caregiver-in-the-loop personalized cognitive exercise in dementia care, achieving sub-6-second latency and positive stakeholder feedback in multi-...
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...
-
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.
-
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
LLM agents in a collaborative 2D game exhibit emergent behaviors such as perspective-taking, theory of mind, and clarification, detected by LLM judges and rated positively by human participants.
-
Pre-Execution Safety Gate & Task Safety Contracts for LLM-Controlled Robot Systems
SafeGate adds a deterministic pre-execution gate and runtime contracts with Z3 SMT solving to block unsafe LLM commands for robots while passing safe ones.
-
ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
L. P . Kaelbling and T. Lozano-P´erez. Integrated task and motion planning in belief space. The International Journal of Robotics Research, 32(9-10):1194–1227, 2013
work page 2013
-
[2]
A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning.Discrete event dynamic systems, 13(1):41–77, 2003
work page 2003
-
[3]
F. Petroni, T. Rockt¨aschel, P . Lewis, A. Bakhtin, Y . Wu, A. H. Miller, and S. Riedel. Language models as knowledge bases?arXiv preprint arXiv:1909.01066, 2019
- [4]
-
[5]
J. Davison, J. Feldman, and A. M. Rush. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th in- ternational joint conference on natural language processing (EMNLP-IJCNLP), pages 1173–1178, 2019
work page 2019
- [6]
-
[7]
A. Roberts, C. Raffel, and N. Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020
-
[8]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P . Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [9]
-
[10]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Large Language Models are Zero-Shot Reasoners
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022
work page internal anchor Pith review arXiv 2022
- [12]
-
[13]
M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021
work page internal anchor Pith review arXiv 2021
-
[14]
L. S. Vygotsky. Thought and language. MIT press, 2012
work page 2012
-
[15]
P . Carruthers. Thinking in language?: evolution and a modularist possibility. Cambridge University Press, 1998
work page 1998
- [16]
-
[17]
L. S. Vygotsky. Play and its role in the mental development of the child.Soviet psychology, 5(3): 6–18, 1967
work page 1967
- [18]
- [19]
- [20]
-
[21]
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P . Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D....
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
L. P . Kaelbling and T. Lozano-P´erez. Hierarchical planning in the now. In W orkshops at the T wenty-F ourth AAAI Conference on Artificial Intelligence, 2010
work page 2010
-
[23]
S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P . Abbeel. Combined task and motion planning through an extensible planner-independent interface layer. In2014 IEEE international conference on robotics and automation (ICRA), 2014
work page 2014
-
[24]
R. E. Fikes and N. J. Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 1971
work page 1971
-
[25]
E. D. Sacerdoti. A structure for plans and behavior. Technical report, SRI International, Menlo Park California Artificial Intelligence Center, 1975
work page 1975
-
[26]
D. Nau, Y . Cao, A. Lotem, and H. Munoz-Avila. Shop: Simple hierarchical ordered planner. In Proceedings of the 16th international joint conference on Artificial intelligence, 1999
work page 1999
-
[27]
S. M. LaV alle.Planning algorithms. Cambridge university press, 2006
work page 2006
- [28]
-
[29]
M. A. Toussaint, K. R. Allen, K. A. Smith, and J. B. Tenenbaum. Differentiable physics and stable modes for tool-use and manipulation planning.Robotics: Science and Systems F oundation, 2018
work page 2018
-
[30]
B. Eysenbach, R. R. Salakhutdinov, and S. Levine. Search on the replay buffer: Bridging planning and reinforcement learning.Advances in Neural Information Processing Systems, 2019
work page 2019
-
[31]
D. Xu, S. Nair, Y . Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese. Neural task programming: Learning to generalize across hierarchical tasks. In2018 IEEE International Conference on Robotics and Automation (ICRA), 2018
work page 2018
-
[32]
D. Xu, R. Mart´ın-Mart´ın, D.-A. Huang, Y . Zhu, S. Savarese, and L. F. Fei-Fei. Regression planning networks. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
- [33]
-
[34]
D. Shah, P . Xu, Y . Lu, T. Xiao, A. Toshev, S. Levine, and B. Ichter. V alue function spaces: Skill-centric state abstractions for long-horizon reasoning. ICLR, 2022. URL https://openreview.net/pdf?id=vgqS1vkkCbE
work page 2022
-
[35]
A. Srinivas, A. Jabri, P . Abbeel, S. Levine, and C. Finn. Universal planning networks: Learning generalizable representations for visuomotor control. InInternational Conference on Machine Learning, pages 4732–4741. PMLR, 2018
work page 2018
-
[36]
T. Kurutach, A. Tamar, G. Y ang, S. J. Russell, and P . Abbeel. Learning plannable representations with causal infogan. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[37]
A. Akakzia, C. Colas, P .-Y . Oudeyer, M. Chetouani, and O. Sigaud. Grounding language to autonomously-acquired skills via goal generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=chPjI5KMHG
work page 2021
- [38]
- [39]
- [40]
-
[41]
M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus. Interpreting and executing recipes with a cooking robot. InExperimental Robotics, pages 481–495. Springer, 2013
work page 2013
- [42]
- [43]
- [44]
-
[45]
S. Nair and C. Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation.ArXiv, abs/1909.05829, 2020
-
[46]
F. Xia, C. Li, R. Mart´ın-Mart´ın, O. Litany, A. Toshev, and S. Savarese. Relmogen: Integrating motion generation in reinforcement learning for mobile manipulation. In2021 IEEE International Conference on Robotics and Automation (ICRA), 2021
work page 2021
-
[47]
C. Li, F. Xia, R. Martin-Martin, and S. Savarese. Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators. InConference on Robot Learning, 2020
work page 2020
- [48]
-
[49]
arXiv preprint arXiv:2206.04114 , year=
D. Hafner, K.-H. Lee, I. Fischer, and P . Abbeel. Deep hierarchical planning from pixels.arXiv preprint arXiv:2206.04114, 2022
-
[50]
S. Mirchandani, S. Karamcheti, and D. Sadigh. Ella: Exploration through learned language abstraction. Advances in Neural Information Processing Systems, 34:29529–29540, 2021
work page 2021
- [51]
- [52]
- [53]
-
[54]
M. Chen, J. Tworek, H. Jun, Q. Y uan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[55]
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[56]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[57]
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W . Y u, B. Lester, N. Du, A. M. Dai, and Q. V . Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [58]
-
[59]
S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor. Language-conditioned imitation learning for robot manipulation tasks.Advances in Neural Information Processing Systems, 33:13139–13150, 2020
work page 2020
- [60]
-
[61]
C. Lynch and P . Sermanet. Language conditioned imitation learning over unstructured data.Robotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648
-
[62]
Y . Chen, R. Xu, Y . Lin, and P . A. V ela. A joint network for grasp detection conditioned on natural language commands. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4576–4582. IEEE, 2021
work page 2021
- [63]
- [64]
-
[65]
A. Radford, J. W . Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[66]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W . Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[67]
J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019
work page 2019
- [68]
- [69]
-
[70]
T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton. Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020
work page 2020
- [71]
- [72]
-
[73]
F. Sener and A. Y ao. Zero-shot anticipation for instructional activities. InProceedings of the IEEE/CVF International Conference on Computer V ision, pages 862–871, 2019
work page 2019
-
[74]
A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi. Simple but effective: Clip embeddings for embodied ai. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 14829–14838, 2022
work page 2022
-
[75]
A. Zeng, P . Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, and J. Lee. Transporter networks: Rearranging the visual world for robotic manipulation. Conference on Robot Learning (CoRL), 2020
work page 2020
-
[76]
M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022
work page 2022
-
[77]
X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921, 2021
-
[78]
I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps.The International Journal of Robotics Research, 34(4-5):705–724, 2015
work page 2015
-
[79]
F.-J. Chu, R. Xu, and P . A. V ela. Real-world multiobject, multigrasp detection.IEEE Robotics and Automation Letters, 3(4):3355–3362, 2018
work page 2018
-
[80]
Mt-opt: Continuous multi-task robotic reinforcement learning at scale,
D. Kalashnikov, J. V arley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale.arXiv preprint arXiv:2104.08212, 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.