pith. machine review for the scientific record. sign in

arxiv: 2505.06111 · v3 · submitted 2025-05-09 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Guanghui Ren, Hongyang Li, Jisong Cai, Maoqing Yao, Ping Luo, Qingwen Bu, Shenyuan Gao, Yanting Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 15:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords vision-language-actionlatent actionscross-embodimentrobot learninggeneralist policyDINO featuresvideo pretrainingmanipulation
0
0 comments X

The pith

UniVLA derives task-centric latent actions from unlabeled videos to build cross-embodiment robot policies that outperform prior methods with far less data and compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action models depend on large volumes of action-annotated data tied to one robot body, which restricts scaling and generalization. UniVLA instead trains a latent action model on broad internet videos to isolate only the actions relevant to a given language instruction. It performs this extraction inside DINO visual features so that task-irrelevant motion is suppressed. The resulting representations let a single policy be decoded for different physical robots, yielding stronger benchmark results while using less than one-twentieth the pretraining compute and one-tenth the downstream data of OpenVLA. Performance continues to rise as more heterogeneous videos, including human demonstrations, are added to training.

Core claim

A language-conditioned latent action model trained inside DINO feature space on internet-scale videos produces task-centric action representations that transfer across embodiments. These representations are decoded at deployment time to drive specific robots, delivering state-of-the-art results on manipulation and navigation benchmarks as well as real-robot tests. The same pipeline improves steadily when additional heterogeneous data sources are included.

What carries the argument

Language-conditioned latent action model inside DINO feature space that extracts task-centric representations from videos while suppressing irrelevant dynamics.

Load-bearing premise

The latent action model trained on internet videos will consistently separate task-relevant dynamics from embodiment-specific or irrelevant motion and those separations will transfer to new robots.

What would settle it

Measure whether a policy trained only on the video corpus matches or exceeds OpenVLA performance when deployed on a previously unseen robot embodiment without any additional action labels for that body.

read the original abstract

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes UniVLA, a cross-embodiment VLA policy framework that derives task-centric latent actions from internet-scale videos via a language-conditioned latent action model operating in DINO feature space. This representation is intended to suppress task-irrelevant dynamics, enabling pretraining on heterogeneous data (including human videos) with far less compute and downstream data than prior methods like OpenVLA, while achieving SOTA results on manipulation/navigation benchmarks and real-robot deployments, with performance scaling as more diverse video data is added.

Significance. If the central claims hold, the work offers a promising path toward scalable robot learning by leveraging abundant unlabeled video data rather than action-annotated trajectories, potentially reducing embodiment-specific data requirements and enabling broader transfer. The reported efficiency gains and inclusion of human videos are notable strengths that could influence future VLA designs if the task-centric mechanism is validated.

major comments (3)
  1. [§3.2] §3.2 (Latent Action Model): The model is trained with language conditioning inside DINO features to mitigate task-irrelevant dynamics, but no equation or loss term (e.g., no explicit action reconstruction, contrastive, or invariance objective) is provided that enforces disentanglement of actionable elements from camera motion or background changes. Without this, the claim that the representations are task-centric and transfer across embodiments rests on an unverified assumption, and gains may simply reflect increased data volume.
  2. [§4] §4 (Experiments) and associated tables: The superior performance over OpenVLA with <1/20 pretraining compute and 1/10 downstream data is reported, yet no ablation controls for total data volume or compares against a non-latent-action baseline trained on the identical heterogeneous corpus (including human videos). This leaves open whether the efficiency and scaling claims are due to the proposed mechanism or simply more data.
  3. [§4.3] §4.3 (Real-robot deployments): Continuous improvements are claimed as human videos are added, but the results lack error bars, statistical significance tests, or per-embodiment breakdowns that would confirm transfer rather than embodiment-specific overfitting.
minor comments (2)
  1. [Abstract / §2] The abstract and introduction use 'task-centric latent actions' without an early formal definition or diagram; a concise equation or figure in §2 would improve readability.
  2. [Figures / Tables] Figure captions and table footnotes should explicitly state the number of seeds/runs and whether DINO features are frozen during policy training.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential. We address each major comment below with proposed revisions to improve clarity, experimental rigor, and statistical reporting.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Latent Action Model): The model is trained with language conditioning inside DINO features to mitigate task-irrelevant dynamics, but no equation or loss term (e.g., no explicit action reconstruction, contrastive, or invariance objective) is provided that enforces disentanglement of actionable elements from camera motion or background changes. Without this, the claim that the representations are task-centric and transfer across embodiments rests on an unverified assumption, and gains may simply reflect increased data volume.

    Authors: We appreciate the referee highlighting the need for explicit formulation. The latent action model in §3.2 uses language conditioning within DINO features to suppress task-irrelevant elements such as camera motion. To make the mechanism transparent, we will revise §3.2 to include the full mathematical description of the model architecture and the training objectives, which incorporate language-conditioned reconstruction to prioritize actionable components. This addition will directly support the task-centric property and address the concern about unverified assumptions. revision: yes

  2. Referee: [§4] §4 (Experiments) and associated tables: The superior performance over OpenVLA with <1/20 pretraining compute and 1/10 downstream data is reported, yet no ablation controls for total data volume or compares against a non-latent-action baseline trained on the identical heterogeneous corpus (including human videos). This leaves open whether the efficiency and scaling claims are due to the proposed mechanism or simply more data.

    Authors: This comment correctly identifies a gap in the experimental controls. Our primary baselines compare against OpenVLA on its published data, and we show performance scaling with added heterogeneous videos. However, training a full non-latent-action VLA baseline on the exact same large-scale heterogeneous corpus (including human videos) would require prohibitive additional compute. In the revision, we will expand §4 with a dedicated discussion of this limitation, clarify why the latent action approach enables heterogeneous data use, and include partial ablations on data volume subsets to better isolate the contribution of the proposed mechanism. revision: partial

  3. Referee: [§4.3] §4.3 (Real-robot deployments): Continuous improvements are claimed as human videos are added, but the results lack error bars, statistical significance tests, or per-embodiment breakdowns that would confirm transfer rather than embodiment-specific overfitting.

    Authors: We agree that enhanced statistical reporting is necessary for the real-robot results. The deployments in §4.3 were performed across multiple trials on different robot platforms. We will revise this section to report error bars (standard deviation across trials), specify trial counts, provide per-embodiment success rate breakdowns, and include basic statistical significance tests to better demonstrate cross-embodiment transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretraining and heterogeneous data

full rationale

The paper's central claims rest on training a latent action model in pretrained DINO feature space using internet-scale videos (including human data) conditioned on language, then decoding for robot policies. This chain uses independent external components (DINO features, large video corpora) rather than fitting parameters to the target robot data and renaming them as predictions. No equations reduce performance metrics to self-defined quantities, and no load-bearing self-citations or uniqueness theorems are invoked to force the result. Empirical gains from adding heterogeneous data are presented as observations, not tautological outputs of the model definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the latent action model at isolating task-centric information from heterogeneous video sources.

axioms (1)
  • domain assumption Language instructions can be used to suppress task-irrelevant dynamics when learning latent actions from video
    Explicitly invoked in the abstract to justify the DINO-space latent action model.
invented entities (1)
  • task-centric latent actions no independent evidence
    purpose: Compact representation of actions extracted from video that transfers across robot embodiments
    Core new construct introduced to enable training on unlabeled video instead of action-annotated data

pith-pipeline@v0.9.0 · 5550 in / 1354 out tokens · 56423 ms · 2026-05-12T15:22:32.646114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  2. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  3. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  4. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  5. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  6. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    cs.CV 2026-04 unverdicted novelty 7.0

    CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

  7. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  8. CUBic: Coordinated Unified Bimanual Perception and Control Framework

    cs.RO 2026-05 unverdicted novelty 6.0

    CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...

  9. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  10. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  11. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  12. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  13. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  14. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  15. From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...

  16. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  17. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  18. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  19. UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    cs.RO 2026-04 unverdicted novelty 6.0

    UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

  20. OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.

  21. AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...

  22. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  23. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  24. ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.

  25. SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

    cs.RO 2026-04 unverdicted novelty 6.0

    SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...

  26. EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

    cs.RO 2026-04 unverdicted novelty 6.0

    EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...

  27. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  28. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  29. AttenA+: Rectifying Action Inequality in Robotic Foundation Models

    cs.RO 2026-05 unverdicted novelty 5.0

    AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.

  30. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  31. Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study

    cs.LG 2026-04 unverdicted novelty 5.0

    Explicit geometry-based feasibility supervision added to diffusion VLA training leads to better physical reliability, task success, and faster learning with limited data in manipulation tasks.

  32. R3D: Revisiting 3D Policy Learning

    cs.CV 2026-04 unverdicted novelty 5.0

    A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

  33. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 30 Pith papers · 4 internal anchors

  1. [1]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian...

  2. [2]

    Deep variational information bottleneck

    Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InICLR, 2017. 4

  3. [3]

    Laser: Learning a latent action space for efficient rein- forcement learning

    Arthur Allshire, Roberto Mart ´ın-Mart´ın, Charles Lin, Shawn Manuel, Silvio Savarese, and Animesh Garg. Laser: Learning a latent action space for efficient rein- forcement learning. InICRA, 2021. 3

  4. [4]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InCVPR, 2018. 2, 5, 7

  5. [5]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR,

  6. [6]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024. 3

  7. [7]

    HY- DRA: Hybrid robot actions for imitation learning.arXiv preprint arXiv:2306.17237, 2023

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. HY- DRA: Hybrid robot actions for imitation learning.arXiv preprint arXiv:2306.17237, 2023. 16

  8. [8]

    Towards generalizable zero-shot manipulation via translating human interaction plans

    Homanga Bharadhwaj, Abhinav Gupta, Vikash Kumar, and Shubham Tulsiani. Towards generalizable zero-shot manipulation via translating human interaction plans. In ICRA, 2024. 2

  9. [9]

    Zero-shot robotic manipulation with pre- trained image-editing diffusion models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre- trained image-editing diffusion models. InICLR, 2024. 2, 16

  10. [10]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 1, 2, 4

  11. [11]

    RT-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRSS, 2023. 2, 16, 17

  12. [12]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML,

  13. [13]

    Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001, 2024a

    Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, and Yu Qiao. Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001,

  14. [14]

    Closed-loop visuomotor control with gen- erative expectation for robotic manipulation

    Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, and Hongyang Li. Closed-loop visuomotor control with gen- erative expectation for robotic manipulation. InNeurIPS,

  15. [15]

    Berkeley UR5 demonstration dataset

    Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home. 16

  16. [16]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

    Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. IGOR: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024. 2, 3

  17. [17]

    Dif- fusion Policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion Policy: Visuomotor policy learning via action diffusion. InRSS, 2023. 6, 8, 18

  18. [18]

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

    A Conneau. Unsupervised cross-lingual representation learning at scale.arXiv preprint arXiv:1911.02116, 2019. 1

  19. [19]

    From play to policy: Conditional behavior genera- tion from uncurated robot data

    Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022. 16

  20. [20]

    DynaMo: In-domain dynamics pretraining for visuo-motor control

    Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, and Lerrel Pinto. DynaMo: In-domain dynamics pretraining for visuo-motor control. InNeurIPS, 2024. 3

  21. [21]

    Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. CLVR jaco play dataset, 2023. URL https://github.com/ clvrai/clvr jaco play dataset. 16

  22. [22]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 1

  23. [23]

    Scaling Cross-Embodied Learning: One policy for manipulation, navigation, locomotion and aviation

    Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling Cross-Embodied Learning: One policy for manipulation, navigation, locomotion and aviation. InCoRL, 2024. 2

  24. [24]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2024. 2

  25. [25]

    Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InICLR, 2024. 11

  26. [26]

    FLIP: Flow-centric generative plan- ning for general-purpose manipulation tasks

    Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. FLIP: Flow-centric generative plan- ning for general-purpose manipulation tasks. InICLR,

  27. [27]

    Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. AdaWorld: Learning adaptable world mod- els with latent actions.arXiv preprint arXiv:2503.18938,

  28. [28]

    Octo: An open-source generalist robot policy

    Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InRSS, 2024. 1, 2, 6, 17

  29. [29]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InCVPR, 2022. 3, 6, 10, 16

  30. [30]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 3

  31. [31]

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation. In RSS, 2023. 16

  32. [32]

    SPOT: Se (3) pose trajectory diffu- sion for object-centric manipulation.arXiv preprint arXiv:2411.00965, 2024

    Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, and Stan Birchfield. SPOT: Se (3) pose trajectory diffu- sion for object-centric manipulation.arXiv preprint arXiv:2411.00965, 2024. 2

  33. [33]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 5, 16, 17

  34. [34]

    BC-Z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-Z: Zero-shot task generalization with robotic imitation learning. InCoRL, 2022. 16

  35. [35]

    MaIL: Improving imitation learning with selective state space models

    Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov, and Gerhard Neumann. MaIL: Improving imitation learning with selective state space models. In CoRL, 2024. 6

  36. [36]

    Scalable deep reinforcement learning for vision- based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision- based robotic manipulation. InCoRL, 2018. 16

  37. [37]

    Pris- matic VLMs: Investigating the design space of visually- conditioned language models

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic VLMs: Investigating the design space of visually- conditioned language models. InICML, 2024. 4, 5, 8

  38. [38]

    DROID: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. In RSS, 2024. 1, 16

  39. [39]

    Open- VLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- VLA: An open-source vision-language-action model. In CoRL, 2024. 1, 2, 4, 6, 7, 8, 10, 16, 17, 18

  40. [40]

    Beyond the nav-graph: Vision- and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision- and-language navigation in continuous environments. In ECCV, 2020. 7

  41. [41]

    Beyond the nav-graph: Vision and language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. In ECCV, 2020. 7

  42. [42]

    Partially observable markov deci- sion processes (pomdps) and robotics.arXiv preprint arXiv:2107.07599, 2021

    Hanna Kurniawati. Partially observable markov deci- sion processes (pomdps) and robotics.arXiv preprint arXiv:2107.07599, 2021. 5

  43. [43]

    Set transformer: A framework for attention-based permutation-invariant neural networks

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InICML, 2019. 5

  44. [44]

    Behavior generation with latent actions

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. InICML, 2024. 3

  45. [45]

    Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024. 5, 11, 17

  46. [46]

    Vision- language foundation models as effective robot imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision- language foundation models as effective robot imitators. InICLR, 2024. 2, 16

  47. [47]

    Evaluating real-world robot manipulation policies in simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InCoRL,

  48. [48]

    LIBERO: Benchmark- ing knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmark- ing knowledge transfer for lifelong robot learning. In NeurIPS, 2024. 2, 5, 6, 9

  49. [49]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2023. 7

  50. [50]

    Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

    Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. InRSS, 2023. 16

  51. [51]

    Multi-stage cable routing through hierarchical imitation learning.TRO, 2023

    Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning.TRO, 2023. 16

  52. [52]

    FMB: a functional manipulation benchmark for generalizable robotic learning.IJRR, 2023

    Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. FMB: a functional manipulation benchmark for generalizable robotic learning.IJRR, 2023. 16

  53. [53]

    Interactive language: Talking to robots in real time.RA-L, 2023

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.RA-L, 2023. 16

  54. [54]

    Vision language models are in-context value learners

    Yecheng Jason Ma, Joey Hejna, Ayzaan Wahid, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, et al. Vision language models are in-context value learners. InICLR, 2025. 11

  55. [55]

    RoboTurk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. InCoRL,

  56. [56]

    CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.RA-L, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.RA-L, 2022. 5, 16

  57. [57]

    Grounding language with visual affordances over un- structured data

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. InICRA, 2023. 16

  58. [58]

    Structured world models from human videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. InCoRL,

  59. [59]

    Quest: Self-supervised skill abstractions for learning continuous control

    Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions for learning continuous control. InNeurIPS,

  60. [60]

    Learning finite-state controllers for partially observable environments.arXiv preprint arXiv:1301.6721, 2013

    Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. Learning finite-state controllers for partially observable environments.arXiv preprint arXiv:1301.6721, 2013. 5

  61. [61]

    Learning and retrieval from prior data for skill- based imitation learning

    Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. InCoRL, 2022. 16

  62. [62]

    DINOv2: Learning robust visual features without supervision.TMLR, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. DINOv2: Learning robust visual features without supervision.TMLR, 2024. 2, 3, 4

  63. [63]

    Open X-Embodiment: Robotic learning datasets and RT-X models

    Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InICRA, 2024. 1, 2, 6, 10, 16

  64. [64]

    Vari- ational autoencoder for deep learning of images, labels and captions

    Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chun- yuan Li, Andrew Stevens, and Lawrence Carin. Vari- ational autoencoder for deep learning of images, labels and captions. InNeurIPS, 2016. 3

  65. [65]

    Shared Control Templates for Assistive Robotics

    Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Joern V ogel. Shared Control Templates for Assistive Robotics. InICRA, Paris, France, 2020. 16

  66. [66]

    Improving language understanding by generative pre-training

    Alec Radford. Improving language understanding by generative pre-training. 2018. 3

  67. [67]

    Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR,

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR,

  68. [68]

    Multimodal Diffusion Trans- former: Learning versatile behavior from multimodal goals

    Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal Diffusion Trans- former: Learning versatile behavior from multimodal goals. InICRA Workshops, 2024. 6

  69. [69]

    Multi-resolution sensing for real-time control with Vision-Language Models

    Saumya Saxena, Mohit Sharma, and Oliver Kroe- mer. Multi-resolution sensing for real-time control with Vision-Language Models. InCoRL, 2023. 16

  70. [70]

    Learning to act without actions

    Dominik Schmidt and Minqi Jiang. Learning to act without actions. InICLR, 2024. 3

  71. [71]

    On bringing robots home

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023. 16

  72. [72]

    GNM: A general navigation model to drive any robot

    Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hi- rose, and Sergey Levine. GNM: A general navigation model to drive any robot. InICRA, 2023. 6, 16

  73. [73]

    MUTEX: Learning unified policies from multimodal task specifications

    Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. MUTEX: Learning unified policies from multimodal task specifications. InCoRL, 2023. 16

  74. [74]

    Ground- ing multimodal large language models in actions

    Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, and Alexander Toshev. Ground- ing multimodal large language models in actions. In NeurIPS, 2024. 3

  75. [75]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 4

  76. [76]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InNeurIPS, 2017. 1, 3

  77. [77]

    Phenaki: Variable length video gener- ation from open domain textual descriptions

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video gener- ation from open domain textual descriptions. InICLR,

  78. [78]

    BridgeData v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. BridgeData v2: A dataset for robot learning at scale. In CoRL, 2023. 1, 2, 6, 10, 16, 17

  79. [79]

    OmniJARVIS: Unified vision- language-action tokenization enables open-world instruc- tion following agents

    Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, and Yitao Liang. OmniJARVIS: Unified vision- language-action tokenization enables open-world instruc- tion following agents. InNeurIPS, 2024. 3

  80. [80]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 5

Showing first 80 references.