Recognition: 1 theorem link
· Lean TheoremPaLM-E: An Embodied Multimodal Language Model
Pith reviewed 2026-05-10 22:25 UTC · model grok-4.3
The pith
One large model can plan robotic actions, answer visual questions, and caption images across different robot bodies by interleaving sensor data with language.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose embodied language models that directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation
What carries the argument
Multi-modal sentences that interleave visual, continuous state estimation, and textual encodings, trained end-to-end with a pre-trained language model.
If this is right
- Can perform sequential robotic manipulation planning from varied observation modalities.
- Solves visual question answering and captioning as part of the same model.
- Shows positive transfer when trained jointly on internet-scale language, vision, and embodied data.
- Larger versions retain general language capabilities while reaching state-of-the-art on visual-language benchmarks like OK-VQA.
- Works on multiple different robot embodiments without task-specific redesign.
Where Pith is reading between the lines
- The same model could potentially interpret natural-language instructions while continuously updating its internal state from live sensors.
- Joint training on internet data may allow embodied systems to improve by scaling rather than by hand-designing new modules for each domain.
- The interleaving approach might extend to other continuous signals such as audio or force feedback in future embodiments.
- Physical deployment on unstructured environments would test whether the learned grounding survives real sensor noise and longer task horizons.
Load-bearing premise
End-to-end training of interleaved visual, state, and text encodings with a pre-trained language model will create robust grounding between words and real-world percepts that generalizes across tasks, modalities, and robot embodiments without extra engineering.
What would settle it
A new robot embodiment or sensor type where the single trained model performs no better than separately engineered models, or shows no benefit from the joint language-vision-robotics training.
read the original abstract
Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PaLM-E, an embodied multimodal language model that directly incorporates real-world continuous sensor modalities (visual, state estimation) into a pre-trained PaLM LLM via interleaved input encodings. These encodings are trained end-to-end alongside the LLM on multiple tasks including sequential robotic manipulation planning, visual question answering, and captioning. The central claims are that a single model can address diverse embodied reasoning tasks across observation modalities and robot embodiments, exhibits positive transfer from joint training on internet-scale language/vision/visual-language data, and that the 562B-parameter variant achieves state-of-the-art on OK-VQA while retaining generalist language capabilities.
Significance. If the empirical results hold under rigorous scrutiny, this would be a significant contribution to embodied AI and multimodal learning. It provides evidence that scaling and joint training across internet-scale and embodied domains can produce generalist models capable of grounded reasoning without task-specific engineering, potentially influencing future work on bridging LLMs with robotics and real-world perception.
major comments (3)
- [Section 4] Section 4 (Experimental Setup and Results): The reported evaluations on robotic manipulation and embodied tasks omit full details on data splits, exact baseline implementations (including whether they use the same PaLM backbone), number of runs, and error bars. This weakens support for the claims of consistent outperformance and positive transfer, as the abstract and results sections present aggregate success metrics without these controls.
- [Section 3.2] Section 3.2 (Input Encoding): The description of how continuous state estimation is tokenized and interleaved with visual and textual encodings is incomplete (no explicit discretization scheme, embedding dimension, or normalization details). This is load-bearing for the grounding claim, as the weakest assumption in the paper is that end-to-end training of these encodings will robustly link words to percepts across embodiments.
- [Table 2] Table 2 / OK-VQA results: The state-of-the-art claim for PaLM-E-562B on OK-VQA lacks a complete set of recent multimodal baselines and an ablation isolating the contribution of the embodied robotics data versus the visual-language pretraining. Without this, it is unclear whether the embodied training is responsible for the reported gains or if they stem primarily from scale.
minor comments (3)
- [Abstract] The abstract and introduction use the term 'positive transfer' without a precise definition or quantitative metric (e.g., improvement over single-task training) in the summary paragraph.
- [Figure 1] Figure 1 (model diagram) would benefit from explicit callouts showing how continuous state values are converted to tokens and interleaved in the input sequence.
- [Section 3] Notation for the multimodal sentence construction (e.g., how visual patches and state vectors are denoted) is introduced without a consolidated table of symbols.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and will incorporate the suggested revisions.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Experimental Setup and Results): The reported evaluations on robotic manipulation and embodied tasks omit full details on data splits, exact baseline implementations (including whether they use the same PaLM backbone), number of runs, and error bars. This weakens support for the claims of consistent outperformance and positive transfer, as the abstract and results sections present aggregate success metrics without these controls.
Authors: We agree that greater transparency in the experimental protocol is needed to support the claims of outperformance and positive transfer. In the revised manuscript, we will expand Section 4 to include: (i) explicit descriptions of all data splits for robotic manipulation and embodied tasks, (ii) confirmation that baselines share the same PaLM backbone and implementation details, (iii) the number of independent runs performed, and (iv) error bars or standard deviations on all reported success rates. These additions will allow readers to better assess the reliability of the results. revision: yes
-
Referee: [Section 3.2] Section 3.2 (Input Encoding): The description of how continuous state estimation is tokenized and interleaved with visual and textual encodings is incomplete (no explicit discretization scheme, embedding dimension, or normalization details). This is load-bearing for the grounding claim, as the weakest assumption in the paper is that end-to-end training of these encodings will robustly link words to percepts across embodiments.
Authors: We acknowledge that the current description of state encoding in Section 3.2 lacks sufficient technical detail. We will revise this section to explicitly specify the discretization scheme applied to continuous state estimates (including binning method and resulting vocabulary size), the embedding dimensionality, and the normalization steps performed prior to interleaving with visual and text tokens. These clarifications will more rigorously support the grounding mechanism across embodiments. revision: yes
-
Referee: [Table 2] Table 2 / OK-VQA results: The state-of-the-art claim for PaLM-E-562B on OK-VQA lacks a complete set of recent multimodal baselines and an ablation isolating the contribution of the embodied robotics data versus the visual-language pretraining. Without this, it is unclear whether the embodied training is responsible for the reported gains or if they stem primarily from scale.
Authors: We partially concur. Table 2 already compares against the primary multimodal models available at the time of submission. To strengthen the presentation, we will add an ablation that directly compares PaLM-E variants trained with and without the embodied robotics data, thereby isolating the contribution of joint training beyond scale and visual-language pretraining. We will also incorporate any additional recent baselines that have appeared since submission, while noting that exhaustive coverage of every concurrent work is inherently limited by publication timelines. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical architecture and training procedure for PaLM-E, with all central claims (task performance, cross-modal transfer, embodiment generalization) resting on reported experimental results from new end-to-end training and evaluation rather than any closed-form derivation or self-referential definition. The pre-trained PaLM component is invoked as an external starting point whose parameters are not redefined or fitted inside the present work; no equation, prediction, or uniqueness claim reduces by construction to quantities already present in the inputs or prior self-citations. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption End-to-end training on interleaved multimodal inputs will establish effective grounding between language and percepts
Forward citations
Cited by 60 Pith papers
-
From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems
A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.
-
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
-
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs
AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...
-
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
-
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction
GoClick is a compact 230M-parameter encoder-decoder VLM for GUI element grounding that matches larger models' accuracy via a Progressive Data Refinement pipeline yielding a 3.8M-sample core set.
-
An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments
Robots autonomously convert LLM-guided experiences into a reusable local method library, reducing average execution time from 7.7772s to 6.7779s and LLM calls per task from 1.0 to 0.2 in repeated-task experiments.
-
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
-
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
EMMA: End-to-End Multimodal Model for Autonomous Driving
EMMA is an end-to-end multimodal LLM that converts camera data into trajectories, objects, and road graphs via text prompts and reports state-of-the-art motion planning on nuScenes plus competitive detection results on Waymo.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
RDT-1B is a diffusion foundation model that unifies action spaces across robots and demonstrates superior bimanual manipulation with zero-shot generalization, language following, and few-shot learning on real robots.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
Octo: An Open-Source Generalist Robot Policy
Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
-
TD-MPC2: Scalable, Robust World Models for Continuous Control
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
-
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...
-
Cross-Modal Navigation with Multi-Agent Reinforcement Learning
CRONA is a MARL framework that uses modality-specialized agents with auxiliary beliefs and a centralized multi-modal critic to achieve better performance and efficiency than single-agent baselines on visual-acoustic n...
-
Visibility-Aware Mobile Grasping in Dynamic Environments
A visibility-aware mobile grasping system with iterative whole-body planning and behavior-tree subgoal generation achieves 68.8% success in unknown static and 58% in dynamic environments, outperforming a baseline by 2...
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ cuts LLM training memory by up to 52% and speeds it up by 1.34x using tailored 4-bit activations and 8-bit gradients with special communication, matching baseline accuracy on LLaMA models.
-
Intention-Aware Semantic Agent Communications for AI Glasses
An intention-aware semantic agent system for AI glasses reduces bandwidth by over 50% in simulations while preserving task performance through adaptive preprocessing guided by inferred user intentions.
-
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
-
Gated Coordination for Efficient Multi-Agent Collaboration in Minecraft Game
Gated escalation and partitioned states enable more efficient multi-agent collaboration in Minecraft by making communication selective rather than automatic.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing
SpaceMind is a self-evolving modular VLM agent framework that achieves 90-100% navigation success in nominal conditions and recovers from failures via experience distillation, with zero-code transfer to physical robot...
-
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
-
ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-...
Reference graph
Works this paper leans on
-
[1]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. Do as i can, not as i say: Ground- ing language in robotic affordances. arXiv preprint arXiv:2204.01691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Flamingo: a Visual Language Model for Few-Shot Learning
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198,
work page internal anchor Pith review arXiv
-
[3]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review arXiv
-
[5]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,
work page 1901
-
[6]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H
URL https://arxiv.org/ abs/2205.01883. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a. Chen, T., Saxena, S., Li, L., Fleet, D. J., and Hinton, G. Pix2seq: A language modeling framework...
-
[7]
Pali: A jointly-scaled mul- tilingual language-image model
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794,
-
[8]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review arXiv
-
[9]
Scaling vision transformers to 22 billion parameters
PaLM-E: An Embodied Multimodal Language Model Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442,
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
Improving alignment of dialogue agents via targeted human judgements
Glaese, A., McAleese, N., Trebacz, M., Aslanides, J., Firoiu, V ., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375,
work page internal anchor Pith review arXiv
-
[13]
Instruction-driven history-aware policies for robotic manipulations
Guhur, P.-L., Chen, S., Garcia, R., Tapaswi, M., Laptev, I., and Schmid, C. Instruction-driven history-aware policies for robotic manipulations. arXiv preprint arXiv:2209.04899,
-
[14]
Language models are general-purpose interfaces
Hao, Y ., Song, H., Dong, L., Huang, S., Chi, Z., Wang, W., Ma, S., and Wei, F. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336,
-
[15]
Huang, C., Mees, O., Zeng, A., and Burgard, W. Vi- sual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022a. Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- guage models as zero-shot planners: Extracting action- able knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022b. Huang, W., Xia, F., Xiao, T., Chan, H...
-
[16]
VIMA : General robot manipulation with multimodal prompts
Jiang, Y ., Gupta, A., Zhang, Z., Wang, G., Dou, Y ., Chen, Y ., Fei-Fei, L., Anandkumar, A., Zhu, Y ., and Fan, L. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094,
-
[17]
Large Language Models are Zero-Shot Reasoners
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners.arXiv preprint arXiv:2205.11916,
work page internal anchor Pith review arXiv
-
[18]
The Power of Scale for Parameter-Efficient Prompt Tuning
Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691,
work page internal anchor Pith review arXiv
-
[19]
Solving Quantitative Reasoning Problems with Language Models
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858,
work page internal anchor Pith review arXiv
-
[20]
Visualbert: A simple and perfor- 13 mant baseline for vision and language
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557,
-
[21]
Trocr: Transformer-based optical character recognition with pre-trained models, 2022
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y ., Florencio, D., Zhang, C., Li, Z., and Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282,
-
[22]
Pre-trained language models for interactive decision-making.Preprint arXiv:2202.01771, 2022a
Li, S., Puig, X., Du, Y ., Wang, C., Akyurek, E., Torralba, A., Andreas, J., and Mordatch, I. Pre-trained language models for interactive decision-making. arXiv preprint arXiv:2202.01771,
-
[23]
Code as Policies: Language Model Programs for Embodied Control
PaLM-E: An Embodied Multimodal Language Model Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753,
work page internal anchor Pith review arXiv
-
[24]
Pretrained transformers as universal computation engines
Lu, K., Grover, A., Abbeel, P., and Mordatch, I. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1,
-
[25]
Language conditioned imitation learning over unstructured data,
Lynch, C. and Sermanet, P. Language conditioned imi- tation learning over unstructured data. arXiv preprint arXiv:2005.07648,
-
[26]
Interactive Language : Talking to robots in real time
Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407,
-
[27]
Nottingham, K., Ammanabrolu, P., Suhr, A., Choi, Y ., Ha- jishirzi, H., Singh, S., and Fox, R. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050,
- [28]
-
[29]
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y ., Kay, J., Springenberg, J. T., et al. A generalist agent. arXiv preprint arXiv:2205.06175,
work page internal anchor Pith review arXiv
-
[30]
arXiv preprint arXiv:2106.11297 , year=
Ryoo, M. S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A. Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297,
-
[31]
Sajjadi, M. S. M., Duckworth, D., Mahendran, A., van Steenkiste, S., Paveti ´c, F., Lu ˇci´c, M., Guibas, L. J., Greff, K., and Kipf, T. Object Scene Representa- tion Transformer. NeurIPS, 2022a. URL https: //osrt-paper.github.io/. Sajjadi, M. S. M., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., V ora, S., Lu ˇci´c, M., Duckworth, D., Dosovitsk...
- [32]
-
[33]
Perceiver-actor: A multi-task trans- former for robotic manipulation
Shridhar, M., Manuelli, L., and Fox, D. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pp. 894–906. PMLR, 2022a. Shridhar, M., Manuelli, L., and Fox, D. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022b. Silva, A., Moorman, N., Silva, W., Zaidi, Z., Gopal...
-
[34]
Progprompt: Generating situated robot task plans using large language models
Singh, I., Blukis, V ., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Prog- Prompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302 ,
-
[35]
LaMDA: Language Models for Dialog Applications
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kul- shreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., PaLM-E: An Embodied Multimodal Language Model Du, Y ., et al. Lamda: Language models for dialog appli- cations. arXiv preprint arXiv:2201.08239,
-
[36]
Wang, Z., Cai, S., Liu, A., Ma, X., and Liang, Y . Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560,
-
[37]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elic- its reasoning in large language models. arXiv preprint arXiv:2201.11903,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Robotic skill acquisition via instruction augmentation with vision- language models
Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A., Hausman, K., Levine, S., and Tompson, J. Robotic skill acquisition via instruction augmentation with vision- language models. arXiv preprint arXiv:2211.11736 ,
-
[39]
Zellers, R., Holtzman, A., Peters, M., Mottaghi, R., Kem- bhavi, A., Farhadi, A., and Choi, Y . Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021a. Zellers, R., Lu, X., Hessel, J., Yu, Y ., Park, J. S., Cao, J., Farhadi, A., and Choi, Y . Merlot: Multimodal neural script knowledge models. Ad...
-
[40]
Zhang, Y . and Chai, J. Hierarchical task learning from language instructions with unified transformers and self- monitoring. arXiv preprint arXiv:2106.03427,
-
[41]
1 0.5 Wikipedia text 1 0.5 (robot) Mobile Manipulator, real 6 3.1 (robot) Language Table (Lynch et al., 2022), sim and real 8 4.2 (robot) TAMP, sim 3 1.6 Table 6: Dataset sampling frequency and ratio for the “full mixture” referred to in experiments. Figure 8: Two TAMP environment test examples. Left with 6 objects (training data contains 3-5 objects), ri...
work page 2022
-
[42]
utilizes oracle, one-step affordance functions. B.2. Interactive Language Table We use the Language-Table real-world tabletop setup and simulated environment from Interactive Language (Lynch et al., 2022). Data collection. For each task, given the long horizon instruction, we prompt a labeler to enter a short horizon command every 4 seconds. We pass the s...
work page 2022
-
[43]
0.60 0.67 0.63 PaLM-E-12B from LLM+ViT LLM trained on scratch pretrain frozen Single robot n/a 0.67 0.35 0.46 Single robot 0.90 0.69 0.78 Full mixture 0.95 0.80 0.87 Full mixture 0.92 0.88 0.91 Table 10: Mobile manipulation environment: affordance prediction, showing individual precision and recall scores. E. Image Attribution The image of the New York Kn...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.