arxiv: 2505.06111 · v3 · submitted 2025-05-09 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Guanghui Ren, Hongyang Li, Jisong Cai, Maoqing Yao, Ping Luo, Qingwen Bu, Shenyuan Gao, Yanting Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 15:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords vision-language-actionlatent actionscross-embodimentrobot learninggeneralist policyDINO featuresvideo pretrainingmanipulation

0 comments

The pith

UniVLA derives task-centric latent actions from unlabeled videos to build cross-embodiment robot policies that outperform prior methods with far less data and compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action models depend on large volumes of action-annotated data tied to one robot body, which restricts scaling and generalization. UniVLA instead trains a latent action model on broad internet videos to isolate only the actions relevant to a given language instruction. It performs this extraction inside DINO visual features so that task-irrelevant motion is suppressed. The resulting representations let a single policy be decoded for different physical robots, yielding stronger benchmark results while using less than one-twentieth the pretraining compute and one-tenth the downstream data of OpenVLA. Performance continues to rise as more heterogeneous videos, including human demonstrations, are added to training.

Core claim

A language-conditioned latent action model trained inside DINO feature space on internet-scale videos produces task-centric action representations that transfer across embodiments. These representations are decoded at deployment time to drive specific robots, delivering state-of-the-art results on manipulation and navigation benchmarks as well as real-robot tests. The same pipeline improves steadily when additional heterogeneous data sources are included.

What carries the argument

Language-conditioned latent action model inside DINO feature space that extracts task-centric representations from videos while suppressing irrelevant dynamics.

Load-bearing premise

The latent action model trained on internet videos will consistently separate task-relevant dynamics from embodiment-specific or irrelevant motion and those separations will transfer to new robots.

What would settle it

Measure whether a policy trained only on the video corpus matches or exceeds OpenVLA performance when deployed on a previously unseen robot embodiment without any additional action labels for that body.

read the original abstract

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniVLA's main pitch is learning task-centric latent actions from unlabeled video in DINO space to cut labeled data needs for cross-embodiment VLAs, but the abstract leaves the disentanglement claim untested.

read the letter

The punchline is that UniVLA tries to move past action-labeled data scaling by training a latent action model on internet videos inside DINO features, with language conditioning to keep the representations focused on tasks. This lets them train a generalist policy that decodes to different robots with far less downstream data and compute than OpenVLA, and they report steady gains as they add more heterogeneous video including human footage. If the full experiments show that the latent space actually suppresses camera motion and background changes while preserving actionable elements, the efficiency numbers would be practically useful. The paper does a clean job stating the embodiment bottleneck and picking DINO for its visual stability across views. The trend of continuous improvement with added data types is a straightforward empirical observation worth noting. The soft spots sit in the central mechanism. Nothing in the abstract describes a loss term or regularization that forces the latent actions to ignore task-irrelevant dynamics, so it remains possible the reported gains trace to data volume or model size rather than the claimed task-centric property. The abstract also gives no baselines, ablations, error bars, or decoding details, which keeps the soundness low until the full text is checked. The stress-test concern about disentanglement therefore stands for now. This is aimed at people building generalist robot policies who want to use abundant video instead of robot-specific labels. A reader working on data-efficient VLA or cross-embodiment transfer would get value from the framing if the results check out. It deserves peer review because the efficiency direction addresses a real constraint in the field, even if the paper will need revisions to show the mechanism works as stated.

Referee Report

3 major / 2 minor

Summary. The paper proposes UniVLA, a cross-embodiment VLA policy framework that derives task-centric latent actions from internet-scale videos via a language-conditioned latent action model operating in DINO feature space. This representation is intended to suppress task-irrelevant dynamics, enabling pretraining on heterogeneous data (including human videos) with far less compute and downstream data than prior methods like OpenVLA, while achieving SOTA results on manipulation/navigation benchmarks and real-robot deployments, with performance scaling as more diverse video data is added.

Significance. If the central claims hold, the work offers a promising path toward scalable robot learning by leveraging abundant unlabeled video data rather than action-annotated trajectories, potentially reducing embodiment-specific data requirements and enabling broader transfer. The reported efficiency gains and inclusion of human videos are notable strengths that could influence future VLA designs if the task-centric mechanism is validated.

major comments (3)

[§3.2] §3.2 (Latent Action Model): The model is trained with language conditioning inside DINO features to mitigate task-irrelevant dynamics, but no equation or loss term (e.g., no explicit action reconstruction, contrastive, or invariance objective) is provided that enforces disentanglement of actionable elements from camera motion or background changes. Without this, the claim that the representations are task-centric and transfer across embodiments rests on an unverified assumption, and gains may simply reflect increased data volume.
[§4] §4 (Experiments) and associated tables: The superior performance over OpenVLA with <1/20 pretraining compute and 1/10 downstream data is reported, yet no ablation controls for total data volume or compares against a non-latent-action baseline trained on the identical heterogeneous corpus (including human videos). This leaves open whether the efficiency and scaling claims are due to the proposed mechanism or simply more data.
[§4.3] §4.3 (Real-robot deployments): Continuous improvements are claimed as human videos are added, but the results lack error bars, statistical significance tests, or per-embodiment breakdowns that would confirm transfer rather than embodiment-specific overfitting.

minor comments (2)

[Abstract / §2] The abstract and introduction use 'task-centric latent actions' without an early formal definition or diagram; a concise equation or figure in §2 would improve readability.
[Figures / Tables] Figure captions and table footnotes should explicitly state the number of seeds/runs and whether DINO features are frozen during policy training.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential. We address each major comment below with proposed revisions to improve clarity, experimental rigor, and statistical reporting.

read point-by-point responses

Referee: [§3.2] §3.2 (Latent Action Model): The model is trained with language conditioning inside DINO features to mitigate task-irrelevant dynamics, but no equation or loss term (e.g., no explicit action reconstruction, contrastive, or invariance objective) is provided that enforces disentanglement of actionable elements from camera motion or background changes. Without this, the claim that the representations are task-centric and transfer across embodiments rests on an unverified assumption, and gains may simply reflect increased data volume.

Authors: We appreciate the referee highlighting the need for explicit formulation. The latent action model in §3.2 uses language conditioning within DINO features to suppress task-irrelevant elements such as camera motion. To make the mechanism transparent, we will revise §3.2 to include the full mathematical description of the model architecture and the training objectives, which incorporate language-conditioned reconstruction to prioritize actionable components. This addition will directly support the task-centric property and address the concern about unverified assumptions. revision: yes
Referee: [§4] §4 (Experiments) and associated tables: The superior performance over OpenVLA with <1/20 pretraining compute and 1/10 downstream data is reported, yet no ablation controls for total data volume or compares against a non-latent-action baseline trained on the identical heterogeneous corpus (including human videos). This leaves open whether the efficiency and scaling claims are due to the proposed mechanism or simply more data.

Authors: This comment correctly identifies a gap in the experimental controls. Our primary baselines compare against OpenVLA on its published data, and we show performance scaling with added heterogeneous videos. However, training a full non-latent-action VLA baseline on the exact same large-scale heterogeneous corpus (including human videos) would require prohibitive additional compute. In the revision, we will expand §4 with a dedicated discussion of this limitation, clarify why the latent action approach enables heterogeneous data use, and include partial ablations on data volume subsets to better isolate the contribution of the proposed mechanism. revision: partial
Referee: [§4.3] §4.3 (Real-robot deployments): Continuous improvements are claimed as human videos are added, but the results lack error bars, statistical significance tests, or per-embodiment breakdowns that would confirm transfer rather than embodiment-specific overfitting.

Authors: We agree that enhanced statistical reporting is necessary for the real-robot results. The deployments in §4.3 were performed across multiple trials on different robot platforms. We will revise this section to report error bars (standard deviation across trials), specify trial counts, provide per-embodiment success rate breakdowns, and include basic statistical significance tests to better demonstrate cross-embodiment transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretraining and heterogeneous data

full rationale

The paper's central claims rest on training a latent action model in pretrained DINO feature space using internet-scale videos (including human data) conditioned on language, then decoding for robot policies. This chain uses independent external components (DINO features, large video corpora) rather than fitting parameters to the target robot data and renaming them as predictions. No equations reduce performance metrics to self-defined quantities, and no load-bearing self-citations or uniqueness theorems are invoked to force the result. Empirical gains from adding heterogeneous data are presented as observations, not tautological outputs of the model definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the latent action model at isolating task-centric information from heterogeneous video sources.

axioms (1)

domain assumption Language instructions can be used to suppress task-irrelevant dynamics when learning latent actions from video
Explicitly invoked in the abstract to justify the DINO-space latent action model.

invented entities (1)

task-centric latent actions no independent evidence
purpose: Compact representation of actions extracted from video that transfers across robot embodiments
Core new construct introduced to enable training on unlabeled video instead of action-annotated data

pith-pipeline@v0.9.0 · 5550 in / 1354 out tokens · 56423 ms · 2026-05-12T15:22:32.646114+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
cs.CV 2026-04 unverdicted novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
CUBic: Coordinated Unified Bimanual Perception and Control Framework
cs.RO 2026-05 unverdicted novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
cs.RO 2026-05 unverdicted novelty 5.0

AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study
cs.LG 2026-04 unverdicted novelty 5.0

Explicit geometry-based feasibility supervision added to diffusion VLA training leads to better physical reliability, task success, and faster learning with limited data in manipulation tasks.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 30 Pith papers · 4 internal anchors

[1]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Deep variational information bottleneck

Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InICLR, 2017. 4

work page 2017
[3]

Laser: Learning a latent action space for efficient rein- forcement learning

Arthur Allshire, Roberto Mart ´ın-Mart´ın, Charles Lin, Shawn Manuel, Silvio Savarese, and Animesh Garg. Laser: Learning a latent action space for efficient rein- forcement learning. InICRA, 2021. 3

work page 2021
[4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InCVPR, 2018. 2, 5, 7

work page 2018
[5]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR,

work page
[6]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

HY- DRA: Hybrid robot actions for imitation learning.arXiv preprint arXiv:2306.17237, 2023

Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. HY- DRA: Hybrid robot actions for imitation learning.arXiv preprint arXiv:2306.17237, 2023. 16

work page arXiv 2023
[8]

Towards generalizable zero-shot manipulation via translating human interaction plans

Homanga Bharadhwaj, Abhinav Gupta, Vikash Kumar, and Shubham Tulsiani. Towards generalizable zero-shot manipulation via translating human interaction plans. In ICRA, 2024. 2

work page 2024
[9]

Zero-shot robotic manipulation with pre- trained image-editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre- trained image-editing diffusion models. InICLR, 2024. 2, 16

work page 2024
[10]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 1, 2, 4

work page 2023
[11]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRSS, 2023. 2, 16, 17

work page 2023
[12]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML,

work page
[13]

Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001, 2024a

Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, and Yu Qiao. Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001,

work page arXiv
[14]

Closed-loop visuomotor control with gen- erative expectation for robotic manipulation

Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, and Hongyang Li. Closed-loop visuomotor control with gen- erative expectation for robotic manipulation. InNeurIPS,

work page
[15]

Berkeley UR5 demonstration dataset

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home. 16

work page
[16]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. IGOR: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024. 2, 3

work page arXiv 2024
[17]

Dif- fusion Policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion Policy: Visuomotor policy learning via action diffusion. InRSS, 2023. 6, 8, 18

work page 2023
[18]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

A Conneau. Unsupervised cross-lingual representation learning at scale.arXiv preprint arXiv:1911.02116, 2019. 1

work page arXiv 1911
[19]

From play to policy: Conditional behavior genera- tion from uncurated robot data

Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022. 16

work page arXiv 2022
[20]

DynaMo: In-domain dynamics pretraining for visuo-motor control

Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, and Lerrel Pinto. DynaMo: In-domain dynamics pretraining for visuo-motor control. InNeurIPS, 2024. 3

work page 2024
[21]

Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. CLVR jaco play dataset, 2023. URL https://github.com/ clvrai/clvr jaco play dataset. 16

work page 2023
[22]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Scaling Cross-Embodied Learning: One policy for manipulation, navigation, locomotion and aviation

Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling Cross-Embodied Learning: One policy for manipulation, navigation, locomotion and aviation. InCoRL, 2024. 2

work page 2024
[24]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2024. 2

work page 2024
[25]

Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InICLR, 2024. 11

work page 2024
[26]

FLIP: Flow-centric generative plan- ning for general-purpose manipulation tasks

Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. FLIP: Flow-centric generative plan- ning for general-purpose manipulation tasks. InICLR,

work page
[27]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning adaptable world mod- els with latent actions.arXiv preprint arXiv:2503.18938,

work page arXiv
[28]

Octo: An open-source generalist robot policy

Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InRSS, 2024. 1, 2, 6, 17

work page 2024
[29]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022. 3, 6, 10, 16

work page 2022
[30]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 3

work page 2019
[31]

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation. In RSS, 2023. 16

work page 2023
[32]

SPOT: Se (3) pose trajectory diffu- sion for object-centric manipulation.arXiv preprint arXiv:2411.00965, 2024

Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, and Stan Birchfield. SPOT: Se (3) pose trajectory diffu- sion for object-centric manipulation.arXiv preprint arXiv:2411.00965, 2024. 2

work page arXiv 2024
[33]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 5, 16, 17

work page 2022
[34]

BC-Z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-Z: Zero-shot task generalization with robotic imitation learning. InCoRL, 2022. 16

work page 2022
[35]

MaIL: Improving imitation learning with selective state space models

Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov, and Gerhard Neumann. MaIL: Improving imitation learning with selective state space models. In CoRL, 2024. 6

work page 2024
[36]

Scalable deep reinforcement learning for vision- based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision- based robotic manipulation. InCoRL, 2018. 16

work page 2018
[37]

Pris- matic VLMs: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic VLMs: Investigating the design space of visually- conditioned language models. InICML, 2024. 4, 5, 8

work page 2024
[38]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. In RSS, 2024. 1, 16

work page 2024
[39]

Open- VLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- VLA: An open-source vision-language-action model. In CoRL, 2024. 1, 2, 4, 6, 7, 8, 10, 16, 17, 18

work page 2024
[40]

Beyond the nav-graph: Vision- and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In ECCV, 2020. 7

work page 2020
[41]

Beyond the nav-graph: Vision and language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. In ECCV, 2020. 7

work page 2020
[42]

Partially observable markov deci- sion processes (pomdps) and robotics.arXiv preprint arXiv:2107.07599, 2021

Hanna Kurniawati. Partially observable markov deci- sion processes (pomdps) and robotics.arXiv preprint arXiv:2107.07599, 2021. 5

work page arXiv 2021
[43]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InICML, 2019. 5

work page 2019
[44]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. InICML, 2024. 3

work page 2024
[45]

Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024. 5, 11, 17

work page arXiv 2024
[46]

Vision- language foundation models as effective robot imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision- language foundation models as effective robot imitators. InICLR, 2024. 2, 16

work page 2024
[47]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InCoRL,

work page
[48]

LIBERO: Benchmark- ing knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmark- ing knowledge transfer for lifelong robot learning. In NeurIPS, 2024. 2, 5, 6, 9

work page 2024
[49]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2023. 7

work page 2023
[50]

Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. InRSS, 2023. 16

work page 2023
[51]

Multi-stage cable routing through hierarchical imitation learning.TRO, 2023

Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning.TRO, 2023. 16

work page 2023
[52]

FMB: a functional manipulation benchmark for generalizable robotic learning.IJRR, 2023

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. FMB: a functional manipulation benchmark for generalizable robotic learning.IJRR, 2023. 16

work page 2023
[53]

Interactive language: Talking to robots in real time.RA-L, 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.RA-L, 2023. 16

work page 2023
[54]

Vision language models are in-context value learners

Yecheng Jason Ma, Joey Hejna, Ayzaan Wahid, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, et al. Vision language models are in-context value learners. InICLR, 2025. 11

work page 2025
[55]

RoboTurk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. InCoRL,

work page
[56]

CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.RA-L, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.RA-L, 2022. 5, 16

work page 2022
[57]

Grounding language with visual affordances over un- structured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. InICRA, 2023. 16

work page 2023
[58]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. InCoRL,

work page
[59]

Quest: Self-supervised skill abstractions for learning continuous control

Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions for learning continuous control. InNeurIPS,

work page
[60]

Learning finite-state controllers for partially observable environments.arXiv preprint arXiv:1301.6721, 2013

Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. Learning finite-state controllers for partially observable environments.arXiv preprint arXiv:1301.6721, 2013. 5

work page arXiv 2013
[61]

Learning and retrieval from prior data for skill- based imitation learning

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. InCoRL, 2022. 16

work page 2022
[62]

DINOv2: Learning robust visual features without supervision.TMLR, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. DINOv2: Learning robust visual features without supervision.TMLR, 2024. 2, 3, 4

work page 2024
[63]

Open X-Embodiment: Robotic learning datasets and RT-X models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InICRA, 2024. 1, 2, 6, 10, 16

work page 2024
[64]

Vari- ational autoencoder for deep learning of images, labels and captions

Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chun- yuan Li, Andrew Stevens, and Lawrence Carin. Vari- ational autoencoder for deep learning of images, labels and captions. InNeurIPS, 2016. 3

work page 2016
[65]

Shared Control Templates for Assistive Robotics

Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Joern V ogel. Shared Control Templates for Assistive Robotics. InICRA, Paris, France, 2020. 16

work page 2020
[66]

Improving language understanding by generative pre-training

Alec Radford. Improving language understanding by generative pre-training. 2018. 3

work page 2018
[67]

Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR,

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR,

work page
[68]

Multimodal Diffusion Trans- former: Learning versatile behavior from multimodal goals

Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal Diffusion Trans- former: Learning versatile behavior from multimodal goals. InICRA Workshops, 2024. 6

work page 2024
[69]

Multi-resolution sensing for real-time control with Vision-Language Models

Saumya Saxena, Mohit Sharma, and Oliver Kroe- mer. Multi-resolution sensing for real-time control with Vision-Language Models. InCoRL, 2023. 16

work page 2023
[70]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. InICLR, 2024. 3

work page 2024
[71]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023. 16

work page arXiv 2023
[72]

GNM: A general navigation model to drive any robot

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hi- rose, and Sergey Levine. GNM: A general navigation model to drive any robot. InICRA, 2023. 6, 16

work page 2023
[73]

MUTEX: Learning unified policies from multimodal task specifications

Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. MUTEX: Learning unified policies from multimodal task specifications. InCoRL, 2023. 16

work page 2023
[74]

Ground- ing multimodal large language models in actions

Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, and Alexander Toshev. Ground- ing multimodal large language models in actions. In NeurIPS, 2024. 3

work page 2024
[75]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InNeurIPS, 2017. 1, 3

work page 2017
[77]

Phenaki: Variable length video gener- ation from open domain textual descriptions

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video gener- ation from open domain textual descriptions. InICLR,

work page
[78]

BridgeData v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. BridgeData v2: A dataset for robot learning at scale. In CoRL, 2023. 1, 2, 6, 10, 16, 17

work page 2023
[79]

OmniJARVIS: Unified vision- language-action tokenization enables open-world instruc- tion following agents

Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, and Yitao Liang. OmniJARVIS: Unified vision- language-action tokenization enables open-world instruc- tion following agents. InNeurIPS, 2024. 3

work page 2024
[80]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 5

work page 2022

Showing first 80 references.