PaLM-E: An Embodied Multimodal Language Model

Aakanksha Chowdhery; Andy Zeng; Ayzaan Wahid; Brian Ichter; Corey Lynch; Daniel Duckworth; Danny Driess; Fei Xia; Igor Mordatch; Jonathan Tompson

arxiv: 2303.03378 · v1 · submitted 2023-03-06 · 💻 cs.LG · cs.AI· cs.RO

PaLM-E: An Embodied Multimodal Language Model

Danny Driess , Fei Xia , Mehdi S. M. Sajjadi , Corey Lynch , Aakanksha Chowdhery , Brian Ichter , Ayzaan Wahid , Jonathan Tompson

show 14 more authors

Quan Vuong Tianhe Yu Wenlong Huang Yevgen Chebotar Pierre Sermanet Daniel Duckworth Sergey Levine Vincent Vanhoucke Karol Hausman Marc Toussaint Klaus Greff Andy Zeng Igor Mordatch Pete Florence

This is my paper

Pith reviewed 2026-05-10 22:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO

keywords embodied language modelmultimodal learningroboticsvisual question answeringlanguage groundingPaLM-Etransfer learningembodied AI

0 comments

The pith

One large model can plan robotic actions, answer visual questions, and caption images across different robot bodies by interleaving sensor data with language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that large language models can be made to reason about the physical world by feeding them inputs that mix visual observations, robot state readings, and text, then training the whole system end-to-end. A sympathetic reader would care because this promises a single model that grounds words to real percepts without building separate systems for each task or each robot. The authors demonstrate the approach on manipulation planning, visual question answering, and captioning, using data from multiple sensor types and multiple robot platforms. They further report that training jointly on internet-scale language, vision, and robotics data produces positive transfer, and that the biggest version remains a capable general language and visual-language model.

Core claim

We propose embodied language models that directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation

What carries the argument

Multi-modal sentences that interleave visual, continuous state estimation, and textual encodings, trained end-to-end with a pre-trained language model.

If this is right

Can perform sequential robotic manipulation planning from varied observation modalities.
Solves visual question answering and captioning as part of the same model.
Shows positive transfer when trained jointly on internet-scale language, vision, and embodied data.
Larger versions retain general language capabilities while reaching state-of-the-art on visual-language benchmarks like OK-VQA.
Works on multiple different robot embodiments without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same model could potentially interpret natural-language instructions while continuously updating its internal state from live sensors.
Joint training on internet data may allow embodied systems to improve by scaling rather than by hand-designing new modules for each domain.
The interleaving approach might extend to other continuous signals such as audio or force feedback in future embodiments.
Physical deployment on unstructured environments would test whether the learned grounding survives real sensor noise and longer task horizons.

Load-bearing premise

End-to-end training of interleaved visual, state, and text encodings with a pre-trained language model will create robust grounding between words and real-world percepts that generalizes across tasks, modalities, and robot embodiments without extra engineering.

What would settle it

A new robot embodiment or sensor type where the single trained model performs no better than separately engineered models, or shows no benefit from the joint language-vision-robotics training.

read the original abstract

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaLM-E shows a pre-trained LM can take in robot states and vision as tokens and handle multiple embodied tasks with some cross-domain transfer, but the abstract leaves the experimental details thin.

read the letter

PaLM-E takes a pre-trained PaLM and adds encoders for images and continuous robot states so everything gets turned into tokens the language model can process. They train the whole thing end-to-end on a mix of robotics data, visual question answering, and captioning, and the biggest version (562B) reaches state-of-the-art on OK-VQA while still doing language work. The central observation is that joint training across these domains produces positive transfer rather than interference, and the same model works on different robot bodies and input types. That unified input format and the scaling result are the concrete pieces of new work here. The paper does a reasonable job showing that you can reuse language-model capacity for physical tasks without starting from zero, and the multi-embodiment evaluation is broader than most robotics papers manage. The claim that diverse internet-scale data helps the embodied side is the part worth paying attention to if it holds up. The soft spots are mostly in the reporting. The abstract gives high-level success numbers but skips the exact baselines, ablations on the continuous-state channel, error bars, and data-split details, so it is hard to judge how much the embodied training actually moves the needle versus the vision-language pretraining alone. If those sections in the full paper are thin, the transfer story stays suggestive rather than definitive. The tasks also look somewhat controlled, so real-world robustness is still an open question. This paper is for people working on multimodal models who want a data point on whether scaling laws extend to robotics without heavy task-specific engineering. It is not a foundational rethink, but the empirical scope is large enough that a serious editor should send it to referees for a closer check on the experiments and comparisons.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces PaLM-E, an embodied multimodal language model that directly incorporates real-world continuous sensor modalities (visual, state estimation) into a pre-trained PaLM LLM via interleaved input encodings. These encodings are trained end-to-end alongside the LLM on multiple tasks including sequential robotic manipulation planning, visual question answering, and captioning. The central claims are that a single model can address diverse embodied reasoning tasks across observation modalities and robot embodiments, exhibits positive transfer from joint training on internet-scale language/vision/visual-language data, and that the 562B-parameter variant achieves state-of-the-art on OK-VQA while retaining generalist language capabilities.

Significance. If the empirical results hold under rigorous scrutiny, this would be a significant contribution to embodied AI and multimodal learning. It provides evidence that scaling and joint training across internet-scale and embodied domains can produce generalist models capable of grounded reasoning without task-specific engineering, potentially influencing future work on bridging LLMs with robotics and real-world perception.

major comments (3)

[Section 4] Section 4 (Experimental Setup and Results): The reported evaluations on robotic manipulation and embodied tasks omit full details on data splits, exact baseline implementations (including whether they use the same PaLM backbone), number of runs, and error bars. This weakens support for the claims of consistent outperformance and positive transfer, as the abstract and results sections present aggregate success metrics without these controls.
[Section 3.2] Section 3.2 (Input Encoding): The description of how continuous state estimation is tokenized and interleaved with visual and textual encodings is incomplete (no explicit discretization scheme, embedding dimension, or normalization details). This is load-bearing for the grounding claim, as the weakest assumption in the paper is that end-to-end training of these encodings will robustly link words to percepts across embodiments.
[Table 2] Table 2 / OK-VQA results: The state-of-the-art claim for PaLM-E-562B on OK-VQA lacks a complete set of recent multimodal baselines and an ablation isolating the contribution of the embodied robotics data versus the visual-language pretraining. Without this, it is unclear whether the embodied training is responsible for the reported gains or if they stem primarily from scale.

minor comments (3)

[Abstract] The abstract and introduction use the term 'positive transfer' without a precise definition or quantitative metric (e.g., improvement over single-task training) in the summary paragraph.
[Figure 1] Figure 1 (model diagram) would benefit from explicit callouts showing how continuous state values are converted to tokens and interleaved in the input sequence.
[Section 3] Notation for the multimodal sentence construction (e.g., how visual patches and state vectors are denoted) is introduced without a consolidated table of symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Section 4] Section 4 (Experimental Setup and Results): The reported evaluations on robotic manipulation and embodied tasks omit full details on data splits, exact baseline implementations (including whether they use the same PaLM backbone), number of runs, and error bars. This weakens support for the claims of consistent outperformance and positive transfer, as the abstract and results sections present aggregate success metrics without these controls.

Authors: We agree that greater transparency in the experimental protocol is needed to support the claims of outperformance and positive transfer. In the revised manuscript, we will expand Section 4 to include: (i) explicit descriptions of all data splits for robotic manipulation and embodied tasks, (ii) confirmation that baselines share the same PaLM backbone and implementation details, (iii) the number of independent runs performed, and (iv) error bars or standard deviations on all reported success rates. These additions will allow readers to better assess the reliability of the results. revision: yes
Referee: [Section 3.2] Section 3.2 (Input Encoding): The description of how continuous state estimation is tokenized and interleaved with visual and textual encodings is incomplete (no explicit discretization scheme, embedding dimension, or normalization details). This is load-bearing for the grounding claim, as the weakest assumption in the paper is that end-to-end training of these encodings will robustly link words to percepts across embodiments.

Authors: We acknowledge that the current description of state encoding in Section 3.2 lacks sufficient technical detail. We will revise this section to explicitly specify the discretization scheme applied to continuous state estimates (including binning method and resulting vocabulary size), the embedding dimensionality, and the normalization steps performed prior to interleaving with visual and text tokens. These clarifications will more rigorously support the grounding mechanism across embodiments. revision: yes
Referee: [Table 2] Table 2 / OK-VQA results: The state-of-the-art claim for PaLM-E-562B on OK-VQA lacks a complete set of recent multimodal baselines and an ablation isolating the contribution of the embodied robotics data versus the visual-language pretraining. Without this, it is unclear whether the embodied training is responsible for the reported gains or if they stem primarily from scale.

Authors: We partially concur. Table 2 already compares against the primary multimodal models available at the time of submission. To strengthen the presentation, we will add an ablation that directly compares PaLM-E variants trained with and without the embodied robotics data, thereby isolating the contribution of joint training beyond scale and visual-language pretraining. We will also incorporate any additional recent baselines that have appeared since submission, while noting that exhaustive coverage of every concurrent work is inherently limited by publication timelines. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical architecture and training procedure for PaLM-E, with all central claims (task performance, cross-modal transfer, embodiment generalization) resting on reported experimental results from new end-to-end training and evaluation rather than any closed-form derivation or self-referential definition. The pre-trained PaLM component is invoked as an external starting point whose parameters are not redefined or fitted inside the present work; no equation, prediction, or uniqueness claim reduces by construction to quantities already present in the inputs or prior self-citations. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard deep learning assumptions and the pre-trained PaLM model. No new free parameters are explicitly introduced beyond standard training hyperparameters; no invented entities are postulated.

axioms (1)

domain assumption End-to-end training on interleaved multimodal inputs will establish effective grounding between language and percepts
Invoked in the proposal of embodied language models and the training procedure described in the abstract.

pith-pipeline@v0.9.0 · 5600 in / 1197 out tokens · 52510 ms · 2026-05-10T22:25:05.231350+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems
cs.CR 2026-04 unverdicted novelty 8.0

A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from...
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
cs.CV 2026-05 unverdicted novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
cs.RO 2026-05 unverdicted novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs
cs.RO 2026-04 unverdicted novelty 7.0

AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
cs.CV 2026-04 conditional novelty 7.0

Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
cs.AI 2026-04 unverdicted novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
cs.RO 2026-04 unverdicted novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
cs.CV 2026-03 unverdicted novelty 7.0

KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
cs.CV 2026-02 unverdicted novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning
cs.RO 2026-02 unverdicted novelty 7.0

PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
cs.RO 2026-02 unverdicted novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
Large Video Planner Enables Generalizable Robot Control
cs.RO 2025-12 conditional novelty 7.0

A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
cs.LG 2024-12 unverdicted novelty 7.0

The paper introduces the MICL scenario for MLLMs with modality and task shifts and proposes MoInCL using pseudo-target generation and instruction-based distillation, reporting gains over continual learning baselines o...
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
RT-H: Action Hierarchies Using Language
cs.RO 2024-03 conditional novelty 7.0

RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability...
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
cs.RO 2023-10 conditional novelty 7.0

SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
cs.RO 2026-05 unverdicted novelty 6.0

DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over ta...
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
cs.CV 2026-05 unverdicted novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
cs.CV 2026-05 unverdicted novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
cs.CL 2026-05 unverdicted novelty 6.0

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
cs.AI 2026-05 unverdicted novelty 6.0

Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction
cs.CV 2026-04 unverdicted novelty 6.0

GoClick is a compact 230M-parameter encoder-decoder VLM for GUI element grounding that matches larger models' accuracy via a Progressive Data Refinement pipeline yielding a 3.8M-sample core set.
An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments
cs.RO 2026-04 unverdicted novelty 6.0

Robots autonomously convert LLM-guided experiences into a reusable local method library, reducing average execution time from 7.7772s to 6.7779s and LLM calls per task from 1.0 to 0.2 in repeated-task experiments.
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
cs.RO 2026-04 unverdicted novelty 6.0

Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
cs.RO 2026-04 unverdicted novelty 6.0

EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
cs.RO 2026-04 unverdicted novelty 6.0

A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Robust Policy Optimization to Prevent Catastrophic Forgetting
cs.LG 2026-02 unverdicted novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
Real2Sim via Active Perception with Behavior Trees Automatically Generated by VLMs
cs.RO 2026-01 unverdicted novelty 6.0

An intent-driven Real2Sim framework uses VLMs for semantic task decomposition to identify missing physical parameters and generates reactive behavior trees to acquire them via contact-rich robotic interactions on a Fr...
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
cs.RO 2025-11 unverdicted novelty 6.0

SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
cs.RO 2025-11 unverdicted novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
Training-Free Multimodal Large Language Model Orchestration
cs.CL 2025-08 unverdicted novelty 6.0

LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
cs.CV 2025-07 unverdicted novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
cs.CV 2025-05 unverdicted novelty 6.0

Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
cs.RO 2025-05 unverdicted novelty 6.0

GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
cs.LG 2025-04 unverdicted novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 139 Pith papers · 18 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. Do as i can, not as i say: Ground- ing language in robotic affordances. arXiv preprint arXiv:2204.01691,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198,

work page internal anchor Pith review arXiv
[3]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review arXiv
[5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H

URL https://arxiv.org/ abs/2205.01883. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a. Chen, T., Saxena, S., Li, L., Fleet, D. J., and Hinton, G. Pix2seq: A language modeling framework...

work page arXiv
[7]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794,

work page internal anchor Pith review arXiv
[8]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review arXiv
[9]

2023.doi: 10.48550/arXiv.2302.05442

PaLM-E: An Embodied Multimodal Language Model Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442,

work page arXiv
[10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

Improving alignment of dialogue agents via targeted human judgements

Glaese, A., McAleese, N., Trebacz, M., Aslanides, J., Firoiu, V ., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375,

work page internal anchor Pith review arXiv
[13]

Instruction-driven history-aware policies for robotic manipulations

Guhur, P.-L., Chen, S., Garcia, R., Tapaswi, M., Laptev, I., and Schmid, C. Instruction-driven history-aware policies for robotic manipulations. arXiv preprint arXiv:2209.04899,

work page arXiv
[14]

arXiv preprint arXiv:2206.06336 , year=

Hao, Y ., Song, H., Dong, L., Huang, S., Chi, Z., Wang, W., Ma, S., and Wei, F. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336,

work page arXiv
[15]

Visual language maps for robot navigation,

Huang, C., Mees, O., Zeng, A., and Burgard, W. Vi- sual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022a. Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- guage models as zero-shot planners: Extracting action- able knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022b. Huang, W., Xia, F., Xiao, T., Chan, H...

work page arXiv
[16]

Jiang, A

Jiang, Y ., Gupta, A., Zhang, Z., Wang, G., Dou, Y ., Chen, Y ., Fei-Fei, L., Anandkumar, A., Zhu, Y ., and Fan, L. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094,

work page arXiv
[17]

Large Language Models are Zero-Shot Reasoners

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners.arXiv preprint arXiv:2205.11916,

work page internal anchor Pith review arXiv
[18]

The Power of Scale for Parameter-Efficient Prompt Tuning

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efﬁcient prompt tuning. arXiv preprint arXiv:2104.08691,

work page internal anchor Pith review arXiv
[19]

Solving Quantitative Reasoning Problems with Language Models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858,

work page internal anchor Pith review arXiv
[20]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557,

work page internal anchor Pith review arXiv 1908
[21]

TrOCR: Transformer-based Optical Charac- ter Recognition with Pre-trained Models,

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y ., Florencio, D., Zhang, C., Li, Z., and Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282,

work page arXiv
[22]

arXiv preprint arXiv:2202.01771 , year=

Li, S., Puig, X., Du, Y ., Wang, C., Akyurek, E., Torralba, A., Andreas, J., and Mordatch, I. Pre-trained language models for interactive decision-making. arXiv preprint arXiv:2202.01771,

work page arXiv
[23]

Code as Policies: Language Model Programs for Embodied Control

PaLM-E: An Embodied Multimodal Language Model Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753,

work page internal anchor Pith review arXiv
[24]

Pretrained Transformers as universal computation engines

Lu, K., Grover, A., Abbeel, P., and Mordatch, I. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1,

work page arXiv
[25]

Language conditioned imitation learning over unstructured data

Lynch, C. and Sermanet, P. Language conditioned imi- tation learning over unstructured data. arXiv preprint arXiv:2005.07648,

work page arXiv 2005
[26]

Interactive language: Talking to robots in real time,

Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407,

work page arXiv
[27]

Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling

Nottingham, K., Ammanabrolu, P., Suhr, A., Choi, Y ., Ha- jishirzi, H., Singh, S., and Fox, R. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050,

work page arXiv
[28]

URL https://arxiv.org/abs/2209. 04372. Polu, S., Han, J. M., Zheng, K., Baksys, M., Babuschkin, I., and Sutskever, I. Formal mathematics statement curricu- lum learning. arXiv preprint arXiv:2202.01344,

work page arXiv
[29]

A Generalist Agent

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y ., Kay, J., Springenberg, J. T., et al. A generalist agent. arXiv preprint arXiv:2205.06175,

work page internal anchor Pith review arXiv
[30]

Token- learner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297, 2021

Ryoo, M. S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A. Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297,

work page arXiv
[31]

Sajjadi, M. S. M., Duckworth, D., Mahendran, A., van Steenkiste, S., Paveti ´c, F., Lu ˇci´c, M., Guibas, L. J., Greff, K., and Kipf, T. Object Scene Representa- tion Transformer. NeurIPS, 2022a. URL https: //osrt-paper.github.io/. Sajjadi, M. S. M., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., V ora, S., Lu ˇci´c, M., Duckworth, D., Dosovitsk...

work page arXiv
[32]

Sharma, A

Sharma, P., Torralba, A., and Andreas, J. Skill induc- tion and planning with latent language. arXiv preprint arXiv:2110.01517,

work page arXiv
[33]

Perceiver-actor: A multi-task transformer for robotic manipulation

Shridhar, M., Manuelli, L., and Fox, D. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pp. 894–906. PMLR, 2022a. Shridhar, M., Manuelli, L., and Fox, D. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022b. Silva, A., Moorman, N., Silva, W., Zaidi, Z., Gopal...

work page arXiv
[34]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Singh, I., Blukis, V ., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Prog- Prompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302 ,

work page internal anchor Pith review arXiv
[35]

LaMDA: Language Models for Dialog Applications

Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kul- shreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., PaLM-E: An Embodied Multimodal Language Model Du, Y ., et al. Lamda: Language models for dialog appli- cations. arXiv preprint arXiv:2201.08239,

work page Pith review arXiv
[36]

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Wang, Z., Cai, S., Liu, A., Ma, X., and Liang, Y . Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560,

work page internal anchor Pith review arXiv
[37]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elic- its reasoning in large language models. arXiv preprint arXiv:2201.11903,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Robotic skill acquisition via instruction augmentation with vision- language models

Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A., Hausman, K., Levine, S., and Tompson, J. Robotic skill acquisition via instruction augmentation with vision- language models. arXiv preprint arXiv:2211.11736 ,

work page arXiv
[39]

Zellers, A

Zellers, R., Holtzman, A., Peters, M., Mottaghi, R., Kem- bhavi, A., Farhadi, A., and Choi, Y . Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021a. Zellers, R., Lu, X., Hessel, J., Yu, Y ., Park, J. S., Cao, J., Farhadi, A., and Choi, Y . Merlot: Multimodal neural script knowledge models. Ad...

work page arXiv
[40]

Hierarchical task learning from language instructions with unified transformers and self- monitoring.arXiv preprint arXiv:2106.03427, 2021

Zhang, Y . and Chai, J. Hierarchical task learning from language instructions with uniﬁed transformers and self- monitoring. arXiv preprint arXiv:2106.03427,

work page arXiv
[41]

full mixture

1 0.5 Wikipedia text 1 0.5 (robot) Mobile Manipulator, real 6 3.1 (robot) Language Table (Lynch et al., 2022), sim and real 8 4.2 (robot) TAMP, sim 3 1.6 Table 6: Dataset sampling frequency and ratio for the “full mixture” referred to in experiments. Figure 8: Two TAMP environment test examples. Left with 6 objects (training data contains 3-5 objects), ri...

work page 2022
[42]

utilizes oracle, one-step affordance functions. B.2. Interactive Language Table We use the Language-Table real-world tabletop setup and simulated environment from Interactive Language (Lynch et al., 2022). Data collection. For each task, given the long horizon instruction, we prompt a labeler to enter a short horizon command every 4 seconds. We pass the s...

work page 2022
[43]

0.60 0.67 0.63 PaLM-E-12B from LLM+ViT LLM trained on scratch pretrain frozen Single robot n/a 0.67 0.35 0.46 Single robot 0.90 0.69 0.78 Full mixture 0.95 0.80 0.87 Full mixture 0.92 0.88 0.91 Table 10: Mobile manipulation environment: affordance prediction, showing individual precision and recall scores. E. Image Attribution The image of the New York Kn...

work page arXiv 2022

[1] [1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. Do as i can, not as i say: Ground- ing language in robotic affordances. arXiv preprint arXiv:2204.01691,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198,

work page internal anchor Pith review arXiv

[3] [3]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review arXiv

[5] [5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901

[6] [6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H

URL https://arxiv.org/ abs/2205.01883. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a. Chen, T., Saxena, S., Li, L., Fleet, D. J., and Hinton, G. Pix2seq: A language modeling framework...

work page arXiv

[7] [7]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794,

work page internal anchor Pith review arXiv

[8] [8]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review arXiv

[9] [9]

2023.doi: 10.48550/arXiv.2302.05442

PaLM-E: An Embodied Multimodal Language Model Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442,

work page arXiv

[10] [10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[12] [12]

Improving alignment of dialogue agents via targeted human judgements

Glaese, A., McAleese, N., Trebacz, M., Aslanides, J., Firoiu, V ., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375,

work page internal anchor Pith review arXiv

[13] [13]

Instruction-driven history-aware policies for robotic manipulations

Guhur, P.-L., Chen, S., Garcia, R., Tapaswi, M., Laptev, I., and Schmid, C. Instruction-driven history-aware policies for robotic manipulations. arXiv preprint arXiv:2209.04899,

work page arXiv

[14] [14]

arXiv preprint arXiv:2206.06336 , year=

Hao, Y ., Song, H., Dong, L., Huang, S., Chi, Z., Wang, W., Ma, S., and Wei, F. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336,

work page arXiv

[15] [15]

Visual language maps for robot navigation,

Huang, C., Mees, O., Zeng, A., and Burgard, W. Vi- sual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022a. Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- guage models as zero-shot planners: Extracting action- able knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022b. Huang, W., Xia, F., Xiao, T., Chan, H...

work page arXiv

[16] [16]

Jiang, A

Jiang, Y ., Gupta, A., Zhang, Z., Wang, G., Dou, Y ., Chen, Y ., Fei-Fei, L., Anandkumar, A., Zhu, Y ., and Fan, L. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094,

work page arXiv

[17] [17]

Large Language Models are Zero-Shot Reasoners

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners.arXiv preprint arXiv:2205.11916,

work page internal anchor Pith review arXiv

[18] [18]

The Power of Scale for Parameter-Efficient Prompt Tuning

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efﬁcient prompt tuning. arXiv preprint arXiv:2104.08691,

work page internal anchor Pith review arXiv

[19] [19]

Solving Quantitative Reasoning Problems with Language Models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858,

work page internal anchor Pith review arXiv

[20] [20]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557,

work page internal anchor Pith review arXiv 1908

[21] [21]

TrOCR: Transformer-based Optical Charac- ter Recognition with Pre-trained Models,

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y ., Florencio, D., Zhang, C., Li, Z., and Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282,

work page arXiv

[22] [22]

arXiv preprint arXiv:2202.01771 , year=

Li, S., Puig, X., Du, Y ., Wang, C., Akyurek, E., Torralba, A., Andreas, J., and Mordatch, I. Pre-trained language models for interactive decision-making. arXiv preprint arXiv:2202.01771,

work page arXiv

[23] [23]

Code as Policies: Language Model Programs for Embodied Control

PaLM-E: An Embodied Multimodal Language Model Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753,

work page internal anchor Pith review arXiv

[24] [24]

Pretrained Transformers as universal computation engines

Lu, K., Grover, A., Abbeel, P., and Mordatch, I. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1,

work page arXiv

[25] [25]

Language conditioned imitation learning over unstructured data

Lynch, C. and Sermanet, P. Language conditioned imi- tation learning over unstructured data. arXiv preprint arXiv:2005.07648,

work page arXiv 2005

[26] [26]

Interactive language: Talking to robots in real time,

Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407,

work page arXiv

[27] [27]

Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling

Nottingham, K., Ammanabrolu, P., Suhr, A., Choi, Y ., Ha- jishirzi, H., Singh, S., and Fox, R. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050,

work page arXiv

[28] [28]

URL https://arxiv.org/abs/2209. 04372. Polu, S., Han, J. M., Zheng, K., Baksys, M., Babuschkin, I., and Sutskever, I. Formal mathematics statement curricu- lum learning. arXiv preprint arXiv:2202.01344,

work page arXiv

[29] [29]

A Generalist Agent

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y ., Kay, J., Springenberg, J. T., et al. A generalist agent. arXiv preprint arXiv:2205.06175,

work page internal anchor Pith review arXiv

[30] [30]

Token- learner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297, 2021

Ryoo, M. S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A. Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297,

work page arXiv

[31] [31]

Sajjadi, M. S. M., Duckworth, D., Mahendran, A., van Steenkiste, S., Paveti ´c, F., Lu ˇci´c, M., Guibas, L. J., Greff, K., and Kipf, T. Object Scene Representa- tion Transformer. NeurIPS, 2022a. URL https: //osrt-paper.github.io/. Sajjadi, M. S. M., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., V ora, S., Lu ˇci´c, M., Duckworth, D., Dosovitsk...

work page arXiv

[32] [32]

Sharma, A

Sharma, P., Torralba, A., and Andreas, J. Skill induc- tion and planning with latent language. arXiv preprint arXiv:2110.01517,

work page arXiv

[33] [33]

Perceiver-actor: A multi-task transformer for robotic manipulation

Shridhar, M., Manuelli, L., and Fox, D. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pp. 894–906. PMLR, 2022a. Shridhar, M., Manuelli, L., and Fox, D. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022b. Silva, A., Moorman, N., Silva, W., Zaidi, Z., Gopal...

work page arXiv

[34] [34]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Singh, I., Blukis, V ., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Prog- Prompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302 ,

work page internal anchor Pith review arXiv

[35] [35]

LaMDA: Language Models for Dialog Applications

Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kul- shreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., PaLM-E: An Embodied Multimodal Language Model Du, Y ., et al. Lamda: Language models for dialog appli- cations. arXiv preprint arXiv:2201.08239,

work page Pith review arXiv

[36] [36]

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Wang, Z., Cai, S., Liu, A., Ma, X., and Liang, Y . Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560,

work page internal anchor Pith review arXiv

[37] [37]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elic- its reasoning in large language models. arXiv preprint arXiv:2201.11903,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Robotic skill acquisition via instruction augmentation with vision- language models

Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A., Hausman, K., Levine, S., and Tompson, J. Robotic skill acquisition via instruction augmentation with vision- language models. arXiv preprint arXiv:2211.11736 ,

work page arXiv

[39] [39]

Zellers, A

Zellers, R., Holtzman, A., Peters, M., Mottaghi, R., Kem- bhavi, A., Farhadi, A., and Choi, Y . Piglet: Language grounding through neuro-symbolic interaction in a 3d world. arXiv preprint arXiv:2106.00188, 2021a. Zellers, R., Lu, X., Hessel, J., Yu, Y ., Park, J. S., Cao, J., Farhadi, A., and Choi, Y . Merlot: Multimodal neural script knowledge models. Ad...

work page arXiv

[40] [40]

Hierarchical task learning from language instructions with unified transformers and self- monitoring.arXiv preprint arXiv:2106.03427, 2021

Zhang, Y . and Chai, J. Hierarchical task learning from language instructions with uniﬁed transformers and self- monitoring. arXiv preprint arXiv:2106.03427,

work page arXiv

[41] [41]

full mixture

1 0.5 Wikipedia text 1 0.5 (robot) Mobile Manipulator, real 6 3.1 (robot) Language Table (Lynch et al., 2022), sim and real 8 4.2 (robot) TAMP, sim 3 1.6 Table 6: Dataset sampling frequency and ratio for the “full mixture” referred to in experiments. Figure 8: Two TAMP environment test examples. Left with 6 objects (training data contains 3-5 objects), ri...

work page 2022

[42] [42]

utilizes oracle, one-step affordance functions. B.2. Interactive Language Table We use the Language-Table real-world tabletop setup and simulated environment from Interactive Language (Lynch et al., 2022). Data collection. For each task, given the long horizon instruction, we prompt a labeler to enter a short horizon command every 4 seconds. We pass the s...

work page 2022

[43] [43]

0.60 0.67 0.63 PaLM-E-12B from LLM+ViT LLM trained on scratch pretrain frozen Single robot n/a 0.67 0.35 0.46 Single robot 0.90 0.69 0.78 Full mixture 0.95 0.80 0.87 Full mixture 0.92 0.88 0.91 Table 10: Mobile manipulation environment: affordance prediction, showing individual precision and recall scores. E. Image Attribution The image of the New York Kn...

work page arXiv 2022