arxiv: 2207.05608 · v1 · submitted 2022-07-12 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

Recognition: no theorem link

Inner Monologue: Embodied Reasoning through Planning with Language Models

Andy Zeng, Brian Ichter, Fei Xia, Harris Chan, Igor Mordatch, Jacky Liang, Jonathan Tompson, Karol Hausman, Linda Luu, Noah Brown, Pete Florence, Pierre Sermanet, Sergey Levine, Ted Xiao, Tomas Jackson, Wenlong Huang, Yevgen Chebotar

Pith reviewed 2026-05-11 20:04 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG

keywords large language modelsrobot planningembodied reasoninginner monologueclosed-loop feedbacktabletop manipulationkitchen tasks

0 comments

The pith

Language models improve robotic planning by maintaining an inner monologue of natural language feedback from the environment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models can plan robot actions in embodied settings by reasoning over feedback expressed in natural language, without any further training. It introduces the idea of an inner monologue in which the model iteratively incorporates signals such as whether a skill succeeded, what the current scene looks like, or instructions from a human. Experiments across simulated tabletop rearrangement, real tabletop rearrangement, and long-horizon kitchen manipulation show that this closed-loop feedback raises the rate at which high-level instructions are completed. A sympathetic reader would care because the result suggests existing language models can be turned into more adaptable robot planners simply by giving them a way to talk to themselves about what they observe.

Core claim

By treating environment feedback as additional natural-language context, large language models can sustain an inner monologue that lets them revise plans in response to the outcomes of their own actions, producing measurably higher success on instruction-following tasks in both simulation and the real world.

What carries the argument

The inner monologue: an iterative loop in which the language model receives and reasons over language descriptions of success detection, scene state, or human input to update its next plan.

If this is right

Closed-loop language feedback raises completion rates on both simulated and physical tabletop rearrangement.
The same feedback loop improves long-horizon mobile manipulation in a real kitchen.
Multiple language feedback sources can be combined without retraining the underlying model.
Plans adapt dynamically as the world state changes, because the model re-reasons over updated language descriptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to other manipulation or navigation domains if their outcomes can be summarized in language.
Performance may drop in settings where feedback is noisy or incomplete, suggesting a need for verification steps not tested here.
Combining the monologue with direct visual or proprioceptive inputs might increase robustness beyond what language alone provides.
The method implies that future robot systems could rely more on general-purpose language models and less on domain-specific fine-tuning.

Load-bearing premise

Large language models can reliably interpret and act on natural language descriptions of task outcomes and world state without any task-specific training.

What would settle it

Re-run the same tabletop and kitchen tasks while removing or corrupting the language feedback channel and measure whether the reported gains in instruction completion disappear.

read the original abstract

Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs improve robot task completion by incorporating language feedback into planning without retraining, but the gains likely depend on clean external modules that aren't stress-tested for errors.

read the letter

The main takeaway is that this work shows frozen LLMs can form a closed-loop inner monologue by taking in language from success detection, scene descriptions, and human input, which raises high-level instruction success on tabletop rearrangement and real kitchen mobile manipulation tasks compared to open-loop baselines. They run the setup in both simulation and on physical robots across three domains, which gives the results some practical weight. The framing of an inner monologue is a clean way to describe how the model reasons over its own prior outputs plus fresh language signals, and it avoids any LLM fine-tuning, which keeps the approach lightweight. That part is useful for anyone trying to bolt LLMs onto existing robot stacks. The experiments appear to follow prior LLM planning papers closely enough that the incremental step is clear. The soft spot is the feedback pipeline itself. Success detection and scene description come from separate non-LLM modules, yet the paper does not report ablations that add realistic noise, partial views, or contradictory input to those channels. Without those checks it is hard to know whether the LLM is genuinely handling imperfect information or simply riding high-quality oracles. The abstract claims clear gains but leaves the exact numbers, variance, and baseline details thin, so the effect size is still fuzzy. This paper is for people working at the intersection of LLMs and embodied control who need concrete examples of closed-loop language use. A reader already familiar with the prior planning literature will get the most out of the method and the real-robot results. It has enough empirical grounding and a coherent method to deserve a serious referee, though the review should focus on robustness to noisy feedback and fuller quantitative reporting.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes that Large Language Models can perform embodied reasoning for robotic planning by incorporating natural language feedback from the environment into an 'inner monologue' without any additional training. The approach is evaluated on three domains: simulated and real tabletop rearrangement tasks, and real-world long-horizon mobile manipulation in a kitchen setting. The central finding is that closed-loop language feedback from sources like success detection, scene description, and human interaction significantly improves high-level instruction completion rates compared to open-loop baselines.

Significance. If the empirical results are robust, this work demonstrates a promising direction for using off-the-shelf LLMs in dynamic robotic control scenarios, leveraging their reasoning capabilities over language-based feedback to handle uncertainty and changes in the environment. It provides evidence across both simulation and real hardware, which strengthens the claim for practical applicability in robotics.

major comments (2)

The evaluation does not include ablations or stress tests where the feedback modules (success detection, scene description) are perturbed with realistic error rates or partial information. This is load-bearing because the headline improvement relies on the assumption that these external modules provide accurate and complete language feedback that the LLM can effectively reason over.
Details on the exact prompt construction for integrating multiple feedback sources and how the LLM generates the next action or plan are not sufficiently specified, making it difficult to assess the precise mechanism of the 'inner monologue' or to reproduce the results.

minor comments (1)

The abstract claims 'significant improvements' but provides no quantitative metrics, baselines, or error bars, which should be summarized even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We appreciate the opportunity to clarify and strengthen the manuscript. We address each major comment below and have revised the paper accordingly to improve clarity and robustness.

read point-by-point responses

Referee: The evaluation does not include ablations or stress tests where the feedback modules (success detection, scene description) are perturbed with realistic error rates or partial information. This is load-bearing because the headline improvement relies on the assumption that these external modules provide accurate and complete language feedback that the LLM can effectively reason over.

Authors: We agree that controlled stress tests with perturbed feedback would strengthen claims about robustness. Our real-world experiments already use feedback from imperfect perception systems (e.g., object detectors with errors and human-provided scene descriptions), and the inner monologue still yields substantial gains over open-loop baselines in these noisy settings. However, we did not include explicit ablations with injected error rates. In the revised manuscript, we have added a new analysis subsection with simulation results that systematically vary success detection accuracy (0-30% error) and scene description completeness, showing graceful degradation and retained benefits from language-based reasoning up to moderate noise levels. We also discuss failure modes when feedback becomes highly unreliable. revision: yes
Referee: Details on the exact prompt construction for integrating multiple feedback sources and how the LLM generates the next action or plan are not sufficiently specified, making it difficult to assess the precise mechanism of the 'inner monologue' or to reproduce the results.

Authors: We acknowledge that the original description of prompt formatting was high-level. The revised manuscript now includes an expanded appendix with the complete prompt templates for each domain and feedback combination. These templates show the exact structure for concatenating success detection outputs, scene descriptions, and human feedback into the LLM context, along with the system instructions and few-shot examples used. We also provide full example inner-monologue traces (input history and LLM-generated plans) for representative episodes, making the closed-loop reasoning process fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of LLM planning with external feedback

full rationale

The paper presents an empirical study of using frozen LLMs for robotic planning augmented by language feedback from separate modules (success detection, scene description, human input). It reports performance gains on rearrangement and mobile manipulation tasks but contains no derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness claims. All results are obtained by direct experimentation on simulated and real hardware; the central claim is therefore not equivalent to its inputs by construction and remains falsifiable by the reported baselines and ablations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about LLM reasoning capabilities and the utility of language feedback; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption LLMs possess reasoning capabilities applicable to planning and interaction in embodied environments
Invoked in the opening sentences as the basis for applying LLMs beyond NLP.
domain assumption Natural language feedback from the environment can be effectively processed by LLMs to adjust plans
Core premise for the inner monologue mechanism and closed-loop improvement.

pith-pipeline@v0.9.0 · 5572 in / 1278 out tokens · 61445 ms · 2026-05-11T20:04:49.265363+00:00 · methodology

discussion (0)

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models
cs.CL 2023-05 accept novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
Generative Agents: Interactive Simulacra of Human Behavior
cs.HC 2023-04 accept novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
cs.RO 2026-05 unverdicted novelty 7.0

Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
cs.AI 2026-04 unverdicted novelty 7.0

ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
cs.RO 2026-04 conditional novelty 7.0

A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
cs.RO 2026-04 unverdicted novelty 7.0

A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
cs.RO 2022-04 accept novelty 7.0

SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
cs.AI 2026-05 unverdicted novelty 6.0

A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
2.5-D Decomposition for LLM-Based Spatial Construction
cs.AI 2026-05 unverdicted novelty 6.0

2.5-D decomposition lets LLMs achieve 94.6% structural accuracy on a building benchmark by handling only horizontal planning while a symbolic system manages vertical placements from occupancy.
Milestone-Guided Policy Learning for Long-Horizon Language Agents
cs.CL 2026-05 unverdicted novelty 6.0

BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
LoopTrap: Termination Poisoning Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.
Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense
cs.AI 2026-05 unverdicted novelty 6.0 partial

Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
cs.CL 2026-05 unverdicted novelty 6.0

FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
cs.IR 2026-04 unverdicted novelty 6.0

Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
cs.IR 2026-04 unverdicted novelty 6.0

Rabtriever distills a generative reranker into an efficient independent encoder using JEPA and auxiliary reverse KL loss to achieve linear complexity and strong performance on rationale-based retrieval tasks.
Long-Horizon Manipulation via Trace-Conditioned VLA Planning
cs.RO 2026-04 unverdicted novelty 6.0

LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
cs.RO 2026-04 unverdicted novelty 6.0

Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
cs.LG 2026-04 unverdicted novelty 6.0

HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
cs.RO 2026-04 unverdicted novelty 6.0

A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.
An Edge-Host-Cloud Architecture for Robot-Agnostic, Caregiver-in-the-Loop Personalized Cognitive Exercise: Multi-Site Deployment in Dementia Care
cs.RO 2026-04 unverdicted novelty 6.0

Speaking Memories is a robot-agnostic edge-host-cloud architecture for caregiver-in-the-loop personalized cognitive exercise in dementia care, achieving sub-6-second latency and positive stakeholder feedback in multi-...
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
cs.RO 2023-12 conditional novelty 6.0

A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
cs.RO 2026-05 unverdicted novelty 5.0

Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
cs.RO 2026-05 unverdicted novelty 5.0

LLM agents in a collaborative 2D game exhibit emergent behaviors such as perspective-taking, theory of mind, and clarification, detected by LLM judges and rated positively by human participants.
Pre-Execution Safety Gate & Task Safety Contracts for LLM-Controlled Robot Systems
cs.RO 2026-04 unverdicted novelty 5.0

SafeGate adds a deterministic pre-execution gate and runtime contracts with Z3 SMT solving to block unsafe LLM commands for robots while passing safe ones.
ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
cs.CV 2026-04 unverdicted novelty 4.0

ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

190 extracted references · 190 canonical work pages · cited by 38 Pith papers · 13 internal anchors

[1]

L. P . Kaelbling and T. Lozano-P´erez. Integrated task and motion planning in belief space. The International Journal of Robotics Research, 32(9-10):1194–1227, 2013

work page 2013
[2]

A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning.Discrete event dynamic systems, 13(1):41–77, 2003

work page 2003
[3]

H., and Riedel, S

F. Petroni, T. Rockt¨aschel, P . Lewis, A. Bakhtin, Y . Wu, A. H. Miller, and S. Riedel. Language models as knowledge bases?arXiv preprint arXiv:1909.01066, 2019

work page arXiv 1909
[4]

Jiang, F

Z. Jiang, F. F. Xu, J. Araki, and G. Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020

work page 2020
[5]

Davison, J

J. Davison, J. Feldman, and A. M. Rush. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th in- ternational joint conference on natural language processing (EMNLP-IJCNLP), pages 1173–1178, 2019

work page 2019
[6]

Talmor, Y

A. Talmor, Y . Elazar, Y . Goldberg, and J. Berant. olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8:743–758, 2020

work page 2020
[7]

How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

A. Roberts, C. Raffel, and N. Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

work page arXiv 2002
[8]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P . Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[10]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022

work page internal anchor Pith review arXiv 2022
[12]

A. K. Lampinen, I. Dasgupta, S. C. Chan, K. Matthewson, M. H. Tessler, A. Creswell, J. L. McClelland, J. X. Wang, and F. Hill. Can language models learn from explanations in context?arXiv preprint arXiv:2204.02329, 2022

work page arXiv 2022
[13]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021

work page internal anchor Pith review arXiv 2021
[14]

L. S. Vygotsky. Thought and language. MIT press, 2012

work page 2012
[15]

Carruthers

P . Carruthers. Thinking in language?: evolution and a modularist possibility. Cambridge University Press, 1998

work page 1998
[16]

Vygotsky

L. Vygotsky. Tool and symbol in child development.The vygotsky reader, 1994

work page 1994
[17]

L. S. Vygotsky. Play and its role in the mental development of the child.Soviet psychology, 5(3): 6–18, 1967

work page 1967
[18]

Colas, T

C. Colas, T. Karch, C. Moulin-Frier, and P .-Y . Oudeyer. Vygotskian autotelic artificial intelligence: Language and culture internalization for human-like ai.arXiv preprint arXiv:2206.01134, 2022

work page arXiv 2022
[19]

A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . V anhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022

work page arXiv 2022
[20]

Huang, P

W . Huang, P . Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational Conference on Machine Learning. PMLR, 2022. 10

work page 2022
[21]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P . Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

L. P . Kaelbling and T. Lozano-P´erez. Hierarchical planning in the now. In W orkshops at the T wenty-F ourth AAAI Conference on Artificial Intelligence, 2010

work page 2010
[23]

Srivastava, E

S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P . Abbeel. Combined task and motion planning through an extensible planner-independent interface layer. In2014 IEEE international conference on robotics and automation (ICRA), 2014

work page 2014
[24]

R. E. Fikes and N. J. Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 1971

work page 1971
[25]

E. D. Sacerdoti. A structure for plans and behavior. Technical report, SRI International, Menlo Park California Artificial Intelligence Center, 1975

work page 1975
[26]

D. Nau, Y . Cao, A. Lotem, and H. Munoz-Avila. Shop: Simple hierarchical ordered planner. In Proceedings of the 16th international joint conference on Artificial intelligence, 1999

work page 1999
[27]

S. M. LaV alle.Planning algorithms. Cambridge university press, 2006

work page 2006
[28]

Toussaint

M. Toussaint. Logic-geometric programming: An optimization-based approach to combined task and motion planning. InT wenty-F ourth International Joint Conference on Artificial Intelligence, 2015

work page 2015
[29]

M. A. Toussaint, K. R. Allen, K. A. Smith, and J. B. Tenenbaum. Differentiable physics and stable modes for tool-use and manipulation planning.Robotics: Science and Systems F oundation, 2018

work page 2018
[30]

Eysenbach, R

B. Eysenbach, R. R. Salakhutdinov, and S. Levine. Search on the replay buffer: Bridging planning and reinforcement learning.Advances in Neural Information Processing Systems, 2019

work page 2019
[31]

D. Xu, S. Nair, Y . Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese. Neural task programming: Learning to generalize across hierarchical tasks. In2018 IEEE International Conference on Robotics and Automation (ICRA), 2018

work page 2018
[32]

D. Xu, R. Mart´ın-Mart´ın, D.-A. Huang, Y . Zhu, S. Savarese, and L. F. Fei-Fei. Regression planning networks. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[33]

Silver, R

T. Silver, R. Chitnis, N. Kumar, W. McClinton, T. Lozano-Perez, L. P . Kaelbling, and J. Tenenbaum. Inventing relational state and action abstractions for effective and efficient bilevel planning.arXiv preprint arXiv:2203.09634, 2022

work page arXiv 2022
[34]

D. Shah, P . Xu, Y . Lu, T. Xiao, A. Toshev, S. Levine, and B. Ichter. V alue function spaces: Skill-centric state abstractions for long-horizon reasoning. ICLR, 2022. URL https://openreview.net/pdf?id=vgqS1vkkCbE

work page 2022
[35]

Srinivas, A

A. Srinivas, A. Jabri, P . Abbeel, S. Levine, and C. Finn. Universal planning networks: Learning generalizable representations for visuomotor control. InInternational Conference on Machine Learning, pages 4732–4741. PMLR, 2018

work page 2018
[36]

Kurutach, A

T. Kurutach, A. Tamar, G. Y ang, S. J. Russell, and P . Abbeel. Learning plannable representations with causal infogan. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[37]

Akakzia, C

A. Akakzia, C. Colas, P .-Y . Oudeyer, M. Chetouani, and O. Sigaud. Grounding language to autonomously-acquired skills via goal generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=chPjI5KMHG

work page 2021
[38]

S. Pirk, K. Hausman, A. Toshev, and M. Khansari. Modeling long-horizon tasks as sequential interaction landscapes. arXiv preprint arXiv:2006.04843, 2020

work page arXiv 2006
[39]

Kollar, S

T. Kollar, S. Tellex, D. Roy, and N. Roy. Toward understanding natural language directions. In2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 259–266. IEEE, 2010. 11

work page 2010
[40]

Tellex, T

S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Banerjee, S. Teller, and N. Roy. Understanding natural language commands for robotic navigation and mobile manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 1507–1514, 2011

work page 2011
[41]

Bollini, S

M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus. Interpreting and executing recipes with a cooking robot. InExperimental Robotics, pages 481–495. Springer, 2013

work page 2013
[42]

Tellex, R

S. Tellex, R. Knepper, A. Li, D. Rus, and N. Roy. Asking for help using inverse semantics. 2014

work page 2014
[43]

Kollar, S

T. Kollar, S. Tellex, D. Roy, and N. Roy. Grounding verbs of motion in natural language commands to robots. InExperimental robotics, pages 31–47. Springer, 2014

work page 2014
[44]

Blukis, Y

V . Blukis, Y . Terme, E. Niklasson, R. A. Knepper, and Y . Artzi. Learning to map natural language in- structions to physical quadcopter control using simulated flight.arXiv preprint arXiv:1910.09664, 2019

work page arXiv 1910
[45]

Nair and C

S. Nair and C. Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation.ArXiv, abs/1909.05829, 2020

work page arXiv 1909
[46]

F. Xia, C. Li, R. Mart´ın-Mart´ın, O. Litany, A. Toshev, and S. Savarese. Relmogen: Integrating motion generation in reinforcement learning for mobile manipulation. In2021 IEEE International Conference on Robotics and Automation (ICRA), 2021

work page 2021
[47]

C. Li, F. Xia, R. Martin-Martin, and S. Savarese. Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators. InConference on Robot Learning, 2020

work page 2020
[48]

Jiang, S

Y . Jiang, S. Gu, K. Murphy, and C. Finn. Language as an abstraction for hierarchical deep reinforcement learning. InNeurIPS, 2019

work page 2019
[49]

arXiv preprint arXiv:2206.04114 , year=

D. Hafner, K.-H. Lee, I. Fischer, and P . Abbeel. Deep hierarchical planning from pixels.arXiv preprint arXiv:2206.04114, 2022

work page arXiv 2022
[50]

Mirchandani, S

S. Mirchandani, S. Karamcheti, and D. Sadigh. Ella: Exploration through learned language abstraction. Advances in Neural Information Processing Systems, 34:29529–29540, 2021

work page 2021
[51]

P . A. Jansen. Visually-grounded planning without vision: Language models infer detailed plans from high-level instructions.arXiv preprint arXiv:2009.14259, 2020

work page arXiv 2009
[52]

Sharma, A

P . Sharma, A. Torralba, and J. Andreas. Skill induction and planning with latent language.arXiv preprint arXiv:2110.01517, 2021

work page arXiv 2021
[53]

S. Li, X. Puig, Y . Du, C. Wang, E. Akyurek, A. Torralba, J. Andreas, and I. Mordatch. Pre-trained language models for interactive decision-making.arXiv preprint arXiv:2202.01771, 2022

work page arXiv 2022
[54]

M. Chen, J. Tworek, H. Jun, Q. Y uan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[56]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[57]

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W . Y u, B. Lester, N. Du, A. M. Dai, and Q. V . Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[58]

Paxton, Y

C. Paxton, Y . Bisk, J. Thomason, A. Byravan, and D. Foxl. Prospection: Interpretable plans from language by predicting the future. In2019 International Conference on Robotics and Automation (ICRA), pages 6942–6948. IEEE, 2019

work page 2019
[59]

Stepputtis, J

S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor. Language-conditioned imitation learning for robot manipulation tasks.Advances in Neural Information Processing Systems, 33:13139–13150, 2020

work page 2020
[60]

Blukis, R

V . Blukis, R. A. Knepper, and Y . Artzi. Few-shot object grounding and mapping for natural language robot instruction following.arXiv preprint arXiv:2011.07384, 2020. 12

work page arXiv 2011
[61]

and Sermanet, P

C. Lynch and P . Sermanet. Language conditioned imitation learning over unstructured data.Robotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648

work page arXiv 2021
[62]

Y . Chen, R. Xu, Y . Lin, and P . A. V ela. A joint network for grasp detection conditioned on natural language commands. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4576–4582. IEEE, 2021

work page 2021
[63]

O. Mees, L. Hermann, and W . Burgard. What matters in language conditioned robotic imitation learning. arXiv preprint arXiv:2204.06252, 2022

work page arXiv 2022
[64]

C. Y an, F. Carnevale, P . Georgiev, A. Santoro, A. Guy, A. Muldal, C.-C. Hung, J. Abramson, T. Lillicrap, and G. Wayne. Intra-agent speech permits zero-shot task acquisition.arXiv preprint arXiv:2206.03139, 2022

work page arXiv 2022
[65]

Radford, J

A. Radford, J. W . Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[66]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W . Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[67]

J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

work page 2019
[68]

Z. Wang, J. Y u, A. W. Y u, Z. Dai, Y . Tsvetkov, and Y . Cao. Simvlm: Simple visual language model pretraining with weak supervision.arXiv preprint arXiv:2108.10904, 2021

work page arXiv 2021
[69]

Suglia, Q

A. Suglia, Q. Gao, J. Thomason, G. Thattai, and G. Sukhatme. Embodied bert: A transformer model for embodied, language-guided visual task completion.arXiv preprint arXiv:2108.04927, 2021

work page arXiv 2021
[70]

T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton. Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020

work page 2020
[71]

A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y . Y ang, and J. Baldridge. Mural: multimodal, multitask retrieval across languages.arXiv preprint arXiv:2109.05125, 2021

work page arXiv 2021
[72]

Sun, D.-A

J. Sun, D.-A. Huang, B. Lu, Y .-H. Liu, B. Zhou, and A. Garg. Plate: Visually-grounded planning with transformers in procedural tasks.IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022

work page 2022
[73]

Sener and A

F. Sener and A. Y ao. Zero-shot anticipation for instructional activities. InProceedings of the IEEE/CVF International Conference on Computer V ision, pages 862–871, 2019

work page 2019
[74]

Khandelwal, L

A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi. Simple but effective: Clip embeddings for embodied ai. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 14829–14838, 2022

work page 2022
[75]

A. Zeng, P . Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, and J. Lee. Transporter networks: Rearranging the visual world for robotic manipulation. Conference on Robot Learning (CoRL), 2020

work page 2020
[76]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022

work page 2022
[77]

Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921, 2021

work page arXiv 2021
[78]

I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps.The International Journal of Robotics Research, 34(4-5):705–724, 2015

work page 2015
[79]

F.-J. Chu, R. Xu, and P . A. V ela. Real-world multiobject, multigrasp detection.IEEE Robotics and Automation Letters, 3(4):3355–3362, 2018

work page 2018
[80]

Mt-opt: Continuous multi-task robotic reinforcement learning at scale,

D. Kalashnikov, J. V arley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale.arXiv preprint arXiv:2104.08212, 2021

work page arXiv 2021

Showing first 80 references.