Recognition: 2 theorem links
· Lean TheoremDo As I Can, Not As I Say: Grounding Language in Robotic Affordances
Pith reviewed 2026-05-10 22:19 UTC · model grok-4.3
The pith
Large language models can direct robots through complex real-world tasks when their proposals are constrained by pretrained skills and value functions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that real-world grounding via pretrained skills is essential for leveraging language models in robotics. The language model supplies high-level semantic knowledge about task procedures, while the skills and their value functions constrain proposals to actions that are both feasible for the robot's embodiment and appropriate to the current physical context. This division lets the robot act as the model's hands and eyes, enabling execution of long-horizon, abstract natural language instructions on a mobile manipulator.
What carries the argument
Pretrained skills with associated value functions that filter and ground the language model's action proposals to the robot's actual affordances and environment.
If this is right
- Abstract and temporally extended natural language instructions become executable on physical robots without task-specific retraining.
- The need for embodiment-specific grounding beyond language alone is demonstrated through failures of ungrounded models.
- High-level semantic knowledge from language models can be translated into context-appropriate actions via feasibility checks from skills.
- A mobile manipulator can reliably carry out complex real-world commands that combine planning and execution.
Where Pith is reading between the lines
- Expanding the skill library could allow the same language model to handle a wider variety of tasks and environments.
- The separation of high-level semantic planning from low-level grounding may apply to other AI systems that require physical interaction.
- Robustness could be tested by deploying the system in new settings and measuring how often value functions require recalibration.
Load-bearing premise
The collection of pretrained skills must be complete for the needed tasks and their value functions must correctly reflect success probabilities in the specific target environment.
What would settle it
A natural language instruction that the combined system cannot complete because every high-probability proposal from the language model either falls outside the skill set or leads to repeated execution failures despite high predicted value.
read the original abstract
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SayCan, a framework that uses an off-the-shelf large language model to generate candidate natural-language action sequences for long-horizon robotic tasks while grounding those proposals via a library of pretrained skills and their associated value functions; the value functions score the feasibility of each LLM-proposed step in the current physical environment, enabling the robot to execute abstract instructions on a mobile manipulator with higher success rates than LLM-only or skill-only baselines.
Significance. If the central result holds, the work demonstrates a concrete, deployable way to combine the semantic knowledge encoded in LLMs with embodiment-specific affordances, yielding measurable gains on real-robot multi-step tasks. The real-world experiments and the explicit separation of high-level planning from low-level feasibility scoring are strengths that could influence subsequent research on grounded language models for robotics.
major comments (3)
- [§4 and §5] §4 (Method) and §5 (Experiments): the evaluation relies on a hand-curated skill library whose coverage matches the test instructions; no ablation removes skills, adds novel skills, or evaluates tasks that require skills outside the library, so it remains unclear whether reported success rates are attributable to the grounding mechanism or to exhaustive pre-coverage of the test distribution.
- [§3.2 and §5.1] §3.2 (Value Functions) and §5.1: the manuscript gives limited detail on how the skill value functions were trained (data collection protocol, network architecture, training objective, and whether training occurred in the identical environment used at test time); without this information it is difficult to evaluate the claim that the value functions provide reliable grounding without environment-specific recalibration.
- [§5.2] §5.2 (Real-robot results): the reported trials do not include controlled perturbations of the scene (object relocation, lighting change, or minor robot morphology variation) to test whether the pretrained value functions continue to rank actions correctly; such a test is load-bearing for the assertion that the method supplies robust real-world grounding.
minor comments (2)
- [Figure 2] Figure 2 and the accompanying text use the term “value function” without an explicit equation or pseudocode definition of how the LLM likelihood is combined with the skill value; adding a short formal expression would improve clarity.
- [Abstract and §1] The abstract and introduction repeatedly state that the approach “shows the need for real-world grounding,” yet the quantitative comparison is only against LLM-only and skill-only baselines; a brief discussion of why alternative grounding methods (e.g., learned affordance models) were not included would strengthen the narrative.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the recommendation for minor revision. The comments highlight important aspects of our evaluation and method that we will clarify in the revised manuscript. Below we respond to each major comment.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Method) and §5 (Experiments): the evaluation relies on a hand-curated skill library whose coverage matches the test instructions; no ablation removes skills, adds novel skills, or evaluates tasks that require skills outside the library, so it remains unclear whether reported success rates are attributable to the grounding mechanism or to exhaustive pre-coverage of the test distribution.
Authors: We agree that our skill library is curated to match the capabilities needed for the evaluated tasks, which is a common setup for demonstrating grounding in robotics. The key contribution is showing that the LLM can effectively select among these skills using the value functions for feasibility. We do include baselines that use the same skill library without the LLM grounding (e.g., random or scripted selection), which perform worse, suggesting the grounding mechanism adds value beyond just having the skills available. However, we did not evaluate on tasks requiring skills outside the library, as that would require learning new skills, which is outside the scope of this work. In the revision, we will add a paragraph in the discussion section clarifying the scope of the evaluation and noting that handling novel skills is an exciting direction for future research. revision: partial
-
Referee: [§3.2 and §5.1] §3.2 (Value Functions) and §5.1: the manuscript gives limited detail on how the skill value functions were trained (data collection protocol, network architecture, training objective, and whether training occurred in the identical environment used at test time); without this information it is difficult to evaluate the claim that the value functions provide reliable grounding without environment-specific recalibration.
Authors: Thank you for pointing this out; we will provide more details in the revised §3.2. The value functions were trained using a combination of teleoperated demonstrations and self-supervised rollouts collected in the same physical environment as the test tasks, but with randomized initial conditions to encourage generalization. The network is a multimodal transformer that takes RGB images and language skill descriptions as input and outputs a success probability. It was trained with a binary cross-entropy loss on labeled success/failure outcomes. Importantly, once trained, the value functions are used without further fine-tuning or recalibration for the specific instructions in our experiments. We will include these details along with references to the training code and hyperparameters in the appendix. revision: yes
-
Referee: [§5.2] §5.2 (Real-robot results): the reported trials do not include controlled perturbations of the scene (object relocation, lighting change, or minor robot morphology variation) to test whether the pretrained value functions continue to rank actions correctly; such a test is load-bearing for the assertion that the method supplies robust real-world grounding.
Authors: We recognize that our experiments did not include explicit controlled perturbations beyond the natural variations present in the real-world trials (e.g., slight differences in object placement across runs). While this limits the strength of claims about robustness to arbitrary changes, the tasks were performed in a real kitchen environment with some inherent variability, and the value functions were trained to handle such variations. We will revise the manuscript to include a more explicit discussion of this limitation in §5.2 and the conclusion, emphasizing that while our results demonstrate effective grounding in the tested conditions, further stress-testing under perturbations remains important future work. revision: partial
Circularity Check
No significant circularity; derivation relies on independent pretrained components
full rationale
The paper's central mechanism scores LLM-proposed actions by their language-model likelihood multiplied by the value function of a matching pretrained skill. This selection rule and the reported task success rates are not defined in terms of the target results themselves, nor do any equations reduce the claimed long-horizon performance to a fitted parameter or self-referential construction. The value functions and skill library are treated as fixed external inputs trained separately; the contribution of the paper is the combination rule and its empirical demonstration on real robots. No load-bearing step invokes a self-citation chain whose validity is assumed without external verification, and no ansatz or uniqueness theorem is smuggled in to force the architecture.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pretrained skills exist and can be composed via value-function scoring without additional learning during deployment.
- domain assumption The language model's next-token distribution can be treated as a prior over feasible high-level plans once low-probability or infeasible tokens are masked by grounding.
Forward citations
Cited by 60 Pith papers
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems
A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
-
BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding
BARISTA introduces a densely annotated egocentric coffee-preparation video dataset and multi-task benchmark that reveals performance variation across models on compositional visual tasks.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
Effective Explanations Support Planning Under Uncertainty
Explanations scored higher by an LLM-plus-planner model are judged more helpful by people and produce measurably better navigation performance in uncertain environments than lower-scored or no explanations.
-
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
-
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
-
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
-
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs
AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
-
Semantic Area Graph Reasoning for Multi-Robot Language-Guided Search
SAGR builds a semantic area graph from occupancy maps so LLMs can assign rooms to robots for language-guided search, staying competitive with standard exploration while improving semantic target finding by up to 18.8%...
-
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.
-
Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS
Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA an...
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
-
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
-
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
A wrapper for black-box generate-verify AI pipelines that uses a conservative hard-negative reference pool and e-processes to control the probability of releasing on infeasible tasks while permitting release on feasible ones.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
-
Engagement Process: Rethinking the Temporal Interface of Action and Observation
Engagement Process decouples actions and observations into separate time-based event streams within a POMDP structure to explicitly model timing mismatches, deliberation latency, and multi-rate interactions.
-
Weighted Rules under the Stable Model Semantics
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
-
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
-
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
-
Creative Robot Tool Use by Counterfactual Reasoning
Robots discover causal tool features through VLM suggestions and physics-based counterfactual perturbations in simulation, then transfer manipulation skills via conditioned keypoint matching.
-
Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding
InfoCoordiBridge coordinates multi-sensor perception outputs into a single conflict-aware SceneSummary before reasoning to improve consistency and reduce hallucinations in autonomous driving scene understanding.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense
Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
-
An Efficient Metric for Data Quality Measurement in Imitation Learning
Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
-
Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning
RINSE scores robot demonstration trajectories for smoothness via SAL and TED metrics to curate higher-quality data for behavioral cloning, improving success rates with less data on benchmarks and real robots.
-
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
-
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
-
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.
-
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.
-
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains
RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
-
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
Reference graph
Works this paper leans on
-
[1]
E. M. Bender and A. Koller. Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, 2020
work page 2020
-
[2]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017
work page 2017
-
[3]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [4]
- [5]
-
[6]
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021
work page internal anchor Pith review arXiv 2021
-
[7]
LaMDA: Language Models for Dialog Applications
R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022
work page Pith review arXiv 2022
-
[8]
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
A. Chowdhery, S. Narang, J. Devlin, et al. Palm: Scaling language modeling with path- ways. 2022. URL https://storage.googleapis.com/pathways-language-model/ PaLM-paper.pdf
work page 2022
-
[10]
J. J. Gibson. The theory of affordances. The Ecological Approach to Visual Perception, 1977
work page 1977
- [11]
- [12]
-
[13]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learn- ing, pages 991–1002. PMLR, 2021
work page 2021
-
[14]
D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv, 2021
work page 2021
-
[15]
D. Cer, Y . Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo- Cespedes, S. Yuan, C. Tar, et al. Universal sentence encoder.arXiv preprint arXiv:1803.11175, 2018
work page Pith review arXiv 2018
-
[16]
D. Ho, K. Rao, Z. Xu, E. Jang, M. Khansari, and Y . Bai. Retinagan: An object-aware ap- proach to sim-to-real transfer. 2021 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 10920–10926, 2021. 13
work page 2021
-
[17]
M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10740–10749, 2020
work page 2020
-
[18]
S. Srivastava, C. Li, M. Lingelbach, R. Mart ´ın-Mart´ın, F. Xia, K. E. Vainio, Z. Lian, C. Gok- men, S. Buch, K. Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022
work page 2022
-
[19]
A. Hosseini, S. Reddy, D. Bahdanau, R. D. Hjelm, A. Sordoni, and A. Courville. Un- derstanding by understanding not: Modeling negation in language models. arXiv preprint arXiv:2105.03519, 2021
-
[20]
S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor. Language- conditioned imitation learning for robot manipulation tasks. ArXiv, abs/2010.12083, 2020
-
[21]
S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn. Learning language- conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2021
work page 2021
- [22]
-
[23]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022
-
[24]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Inner Monologue: Embodied Reasoning through Planning with Language Models
W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review arXiv 2022
-
[26]
M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipu- lation. In Conference on Robot Learning, 2022
work page 2022
-
[27]
X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui. Open-vocabulary object detection via vision and lan- guage knowledge distillation. arXiv preprint arXiv:2104.13921, 2021
-
[28]
J. M. Siskind. Grounding language in perception. Artificial Intelligence Review, 1994
work page 1994
- [29]
-
[30]
C. Sun, A. Myers, C. V ondrick, K. Murphy, and C. Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019
work page 2019
- [31]
-
[32]
J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic repre- sentations for vision-and-language tasks. Advances in neural information processing systems, 2019
work page 2019
-
[33]
R. Zellers, X. Lu, J. Hessel, Y . Yu, J. S. Park, J. Cao, A. Farhadi, and Y . Choi. Merlot: Multi- modal neural script knowledge models. Advances in Neural Information Processing Systems, 2021
work page 2021
-
[34]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. In International Conference on Machine Learning, 2021
work page 2021
- [35]
-
[36]
A. Pashevich, C. Schmid, and C. Sun. Episodic transformer for vision-and-language naviga- tion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021
work page 2021
- [37]
- [38]
- [39]
- [40]
-
[41]
arXiv preprint arXiv:2006.07185 , year=
A. Akakzia, C. Colas, P.-Y . Oudeyer, M. Chetouani, and O. Sigaud. Grounding language to autonomously-acquired skills via goal generation. arXiv preprint arXiv:2006.07185, 2020
-
[42]
R. Zellers, A. Holtzman, M. Peters, R. Mottaghi, A. Kembhavi, A. Farhadi, and Y . Choi. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021
- [43]
- [44]
- [45]
-
[46]
M. MacMahon, B. Stankiewicz, and B. Kuipers. Walk the talk: Connecting language, knowl- edge, and action in route instructions. 01 2006
work page 2006
- [47]
- [48]
-
[49]
J. Luketina, N. Nardelli, G. Farquhar, J. N. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rockt ¨aschel. A survey of reinforcement learning informed by natural language. In IJCAI, 2019
work page 2019
- [50]
-
[51]
H. Mei, M. Bansal, and M. R. Walter. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In AAAI, 2016
work page 2016
-
[52]
D. K. Misra, J. Langford, and Y . Artzi. Mapping instructions and visual observations to actions with reinforcement learning. In EMNLP, 2017
work page 2017
-
[53]
K. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. Czarnecki, M. Jaderberg, D. Teplyashin, M. Wainwright, C. Apps, D. Hassabis, and P. Blunsom. Grounded language learning in a simulated 3d world. ArXiv, abs/1706.06551, 2017
- [54]
-
[55]
G. Cideron, M. Seurin, F. Strub, and O. Pietquin. Self-educated language agent with hindsight experience replay for instruction following. ArXiv, abs/1910.09451, 2019. 15
- [56]
- [57]
-
[58]
J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. ArXiv, abs/1611.01796, 2017
-
[59]
L. P. Kaelbling and T. Lozano-P ´erez. Hierarchical planning in the now. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010
work page 2010
-
[60]
S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel. Combined task and motion planning through an extensible planner-independent interface layer. In 2014 IEEE international conference on robotics and automation (ICRA), 2014
work page 2014
-
[61]
R. E. Fikes and N. J. Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 1971
work page 1971
-
[62]
E. D. Sacerdoti. A structure for plans and behavior. Technical report, SRI International, Menlo Park California Artificial Intelligence Center, 1975
work page 1975
-
[63]
D. Nau, Y . Cao, A. Lotem, and H. Munoz-Avila. Shop: Simple hierarchical ordered planner. 1999
work page 1999
-
[64]
S. M. LaValle. Planning algorithms. 2006
work page 2006
- [65]
-
[66]
M. A. Toussaint, K. R. Allen, K. A. Smith, and J. B. Tenenbaum. Differentiable physics and stable modes for tool-use and manipulation planning. 2018
work page 2018
-
[67]
D. Xu, S. Nair, Y . Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese. Neural task programming: Learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018
work page 2018
-
[68]
D. Xu, R. Mart ´ın-Mart´ın, D.-A. Huang, Y . Zhu, S. Savarese, and L. F. Fei-Fei. Regression planning networks. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
- [69]
-
[70]
B. Eysenbach, R. R. Salakhutdinov, and S. Levine. Search on the replay buffer: Bridging planning and reinforcement learning. Advances in Neural Information Processing Systems , 2019
work page 2019
-
[71]
arXiv preprint arXiv:1803.00653 , year=
N. Savinov, A. Dosovitskiy, and V . Koltun. Semi-parametric topological memory for naviga- tion. arXiv preprint arXiv:1803.00653, 2018
- [72]
-
[73]
C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox. A joint model of language and perception for grounded attribute learning. arXiv preprint arXiv:1206.6423, 2012
- [74]
-
[75]
C. R. Garrett, C. Paxton, T. Lozano-P ´erez, L. P. Kaelbling, and D. Fox. Online replanning in belief space for partially observable task and motion problems. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020. 16
work page 2020
-
[76]
Y . Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi. Visual semantic planning using deep successor representations. In Proceedings of the IEEE international conference on computer vision, 2017
work page 2017
-
[77]
D. K. Misra, J. Sung, K. Lee, and A. Saxena. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions.The International Journal of Robotics Research, 2016
work page 2016
-
[78]
B. Wu, S. Nair, L. Fei-Fei, and C. Finn. Example-driven model-based reinforcement learning for solving long-horizon visuomotor tasks. In 5th Annual Conference on Robot Learning , 2021
work page 2021
-
[79]
S. Nair and C. Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. ArXiv, abs/1909.05829, 2020
-
[80]
F. Xia, C. Li, R. Mart ´ın-Mart´ın, O. Litany, A. Toshev, and S. Savarese. Relmogen: Integrating motion generation in reinforcement learning for mobile manipulation. In 2021 IEEE Interna- tional Conference on Robotics and Automation (ICRA), 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.