XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Di Wu; Fei Liao; Jian Tang; Kun Wu; Meng Li; Min Wan; Ning Liu; Qingjie Liu; Shanghang Zhang; Shichao Fan

arxiv: 2511.02776 · v2 · pith:GFIZ5KWInew · submitted 2025-11-04 · 💻 cs.RO

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan , Kun Wu , Zhengping Che , Xinhua Wang , Di Wu , Fei Liao , Ning Liu , Yixue Zhang

show 7 more authors

Zhen Zhao Zhiyuan Xu Meng Li Qingjie Liu Shanghang Zhang Min Wan Jian Tang

This is my paper

Pith reviewed 2026-05-18 01:00 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsunified vision-motion codesdual-branch VQ-VAEcross-embodiment learningrobotic manipulationmultimodal representationgeneralization to novel objects

0 comments

The pith

A shared discrete code for vision and motion lets one model control many different robots and tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents XR-1 as a way to build vision-language-action models that work across many robot types by first creating a single intermediate representation of both what the robot sees and how it moves. This representation comes from training a dual-branch VQ-VAE on mixed data that includes different robot bodies and human demonstrations, so the codes capture patterns common to both visual changes and actual motions. The codes then support a three-stage process of self-supervised learning, large-scale pretraining, and task-specific adaptation. If the approach holds, it reduces the usual need to collect and train on embodiment-specific data for every new robot or environment.

Core claim

XR-1 shows that Unified Vision-Motion Codes learned by a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion can serve as an effective bridge between high-dimensional observations and low-level actions. The codes align multimodal information from heterogeneous sources such as varied robot embodiments and human demonstrations, allowing the model to produce precise actions while transferring knowledge without embodiment-specific fine-tuning of the codebook itself.

What carries the argument

Unified Vision-Motion Codes (UVMC) from a dual-branch VQ-VAE that encodes visual dynamics in one branch and robotic motion in the other before mapping both into a shared discrete latent space.

If this is right

Large-scale pretraining can combine data from many robots and human sources into one model without per-embodiment codebooks.
The resulting policies show improved success on manipulation tasks when objects, backgrounds, distractors, or lighting change from training conditions.
One model achieves higher performance than prior VLA systems across six robot embodiments and more than 120 tasks in over 14,000 real-world rollouts.
Task-specific adaptation requires only the final post-training stage once the shared codes are learned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The joint-encoding step could lower data collection needs in other embodied settings where hardware differences create similar observation-action gaps.
Freezing the learned codes during policy training and measuring any drop in transfer performance would test how much embodiment information remains outside the codes.
Applying the same dual-branch structure to additional signals such as language instructions or force feedback might expand the range of tasks that transfer without extra fine-tuning.

Load-bearing premise

The discrete codes from the dual-branch VQ-VAE capture complementary multimodal knowledge that transfers across different robot embodiments and human demonstrations without needing to retrain the codebook for each new body.

What would settle it

Train and evaluate the full XR-1 pipeline but replace the joint dual-branch VQ-VAE with two separate independent VQ-VAEs for vision and motion only; if real-world task success rates stay the same or improve, the claimed benefit of joint encoding does not hold.

Figures

Figures reproduced from arXiv: 2511.02776 by Di Wu, Fei Liao, Jian Tang, Kun Wu, Meng Li, Min Wan, Ning Liu, Qingjie Liu, Shanghang Zhang, Shichao Fan, Xinhua Wang, Yixue Zhang, Zhengping Che, Zhen Zhao, Zhiyuan Xu.

**Figure 1.** Figure 1: We introduce X Robotic Model 1 (XR-1), a versatile and scalable vision-language-action framework. XR-1 supports robust multi-task learning across diverse robot embodiments and environments. ABSTRACT Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges… view at source ↗

**Figure 2.** Figure 2: Overview of X Robotic Model 1 (XR-1). In XR-1, we introduce the Unified Vision-Motion Codes (UVMC), a discrete latent representation that jointly encodes visual dynamics and robotic motion. XR-1 adopts a three-stage training paradigm to enable precise low-level control across diverse robots and tasks. and ct+h at time steps t and t + h, the visual encoder Evis(·) extracts a latent variable zvis: zvis = Evi… view at source ↗

**Figure 3.** Figure 3: Experimental Setup. We evaluate XR-1 across six robot embodiments (Tien Kung 1.0/2.0, Single-/Dual-Arm [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate results across 20 tasks on Dual-Arm UR-5e. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Out-of-box evaluation results of 7 tasks on Dual-Arm Franka. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Fast adaptation on Tien Kung 2.0. Tien Kung 2.0 is an unseen embodiment in XR-D. In this setup, XR-1 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Unseen scenario task setup on Dual-Arm Franka. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of the pretraining datasets used for XR-1. We combine Open-X, RoboMIND, Ego4D, and our [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Out-of-box evaluation results of 7 tasks on Dual-Arm UR-5e. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Fast adaption on Dual-Arm UR5e. Dual-Arm UR5e is an embodiment included in XR-D. In this setup, [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Diverse task settings in evaluation: bimanual collaboration, dexterous manipulation, deformable object [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

read the original abstract

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $\pi_{0.5}$, $\pi_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XR-1 delivers a large real-robot evaluation of joint vision-motion discrete codes but leaves the codebook adaptation question open, which undercuts the cross-embodiment unification story.

read the letter

The main thing to know is that XR-1 delivers a large set of real-robot results using a shared discrete representation for vision and motion, but the details needed to understand why it generalizes across embodiments are not fully spelled out. The new piece is the dual-branch VQ-VAE that jointly discretizes visual dynamics and robot actions into Unified Vision-Motion Codes. These codes then guide the three-stage training: self-supervised code learning, then pretraining on heterogeneous robot and human data, and finally task-specific tuning. The authors report consistent outperformance over baselines like π0.5 and GR00T-N1.5 on held-out tasks with novel objects and lighting changes. The evaluation stands out for its scale. Over 14,000 physical rollouts across six different robot embodiments and 120 manipulation tasks is substantial evidence that the method works in practice. That kind of testing is what moves these models closer to useful deployment. The soft spots are in the supporting analysis. There are no ablations shown for codebook size, commitment loss, or data mixture ratios, and the description leaves open whether the codebook remains fixed after the first stage or continues to update during cross-embodiment pretraining. If the codes adapt to specific embodiments in stage two, the unification benefit without extra fine-tuning does not fully hold up. This is a real but not fatal gap because the performance claims rest on the empirical numbers rather than on a closed theoretical argument. This paper is aimed at people building generalist manipulation policies who want to leverage mixed human and robot datasets. Readers working on discrete latent models for robotics will find the approach familiar but the scale of validation useful. It is coherent and grounded enough to deserve a serious referee. I would recommend sending it to peer review with requests for those ablations and a clear statement on codebook handling.

Referee Report

2 major / 2 minor

Summary. The paper introduces XR-1, a vision-language-action model that learns Unified Vision-Motion Codes (UVMC) via a dual-branch VQ-VAE to jointly encode visual dynamics and robotic motion. These discrete codes act as an intermediate representation bridging high-dimensional observations to low-level actions while aligning complementary multimodal knowledge across heterogeneous sources including diverse robot embodiments and human demonstrations. A three-stage training paradigm is proposed: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment datasets, and (iii) task-specific post-training. The approach is validated through more than 14,000 real-world rollouts on six robot embodiments spanning over 120 manipulation tasks, claiming consistent outperformance over baselines such as π_{0.5}, π_0, RDT, UniVLA, and GR00T-N1.5 together with strong generalization to novel objects, backgrounds, distractors, and illumination changes.

Significance. If the central claims hold, XR-1 would mark a meaningful step toward scalable cross-embodiment VLA models by supplying a unified discrete latent space that exploits complementary vision-motion knowledge without embodiment-specific codebook fine-tuning. The scale of the real-world evaluation (more than 14,000 rollouts across six embodiments) constitutes a clear empirical strength that exceeds typical VLA reporting and lends weight to the generalization results. However, the absence of quantitative controls on the VQ-VAE design choices limits the ability to isolate the precise contribution of UVMC to the reported gains.

major comments (2)

[Three-stage training paradigm] Three-stage training paradigm: The manuscript does not state whether the codebook produced by the dual-branch VQ-VAE in stage (i) remains frozen or continues to be updated during the UVMC-guided pretraining on mixed robot and human data in stage (ii). This detail is load-bearing for the unification claim; if the codebook receives embodiment-specific updates, the explanation for why XR-1 transfers without per-embodiment fine-tuning and outperforms baselines such as π_{0.5} and GR00T-N1.5 on novel objects and lighting is weakened.
[Experimental evaluation] Experimental results: The reported outperformance and generalization rest on more than 14,000 rollouts, yet no quantitative ablations, error bars, or sensitivity analysis are supplied for the codebook size, commitment loss weight, or cross-embodiment alignment losses. Without these controls it is difficult to attribute performance gains specifically to the dual-branch VQ-VAE rather than to data scale or other unablated factors.

minor comments (2)

[Abstract] Abstract: The phrasing 'X Robotic Model 1 (XR-1)' is slightly inconsistent with the title and could be standardized for clarity.
[Methods] Methods: Explicit values or ranges for the data mixture ratios and the number of training stages would improve reproducibility of the three-stage paradigm.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the scale of our real-world evaluation across more than 14,000 rollouts. We address each major comment below with clarifications on our methodology and plans for revision.

read point-by-point responses

Referee: [Three-stage training paradigm] Three-stage training paradigm: The manuscript does not state whether the codebook produced by the dual-branch VQ-VAE in stage (i) remains frozen or continues to be updated during the UVMC-guided pretraining on mixed robot and human data in stage (ii). This detail is load-bearing for the unification claim; if the codebook receives embodiment-specific updates, the explanation for why XR-1 transfers without per-embodiment fine-tuning and outperforms baselines such as π_{0.5} and GR00T-N1.5 on novel objects and lighting is weakened.

Authors: We appreciate this observation. In our framework, the codebook learned via the dual-branch VQ-VAE in stage (i) is kept frozen throughout stage (ii). This design choice ensures that UVMC serves as a fixed, unified discrete representation that aligns complementary vision-motion knowledge across heterogeneous sources (diverse robot embodiments and human demonstrations) without any embodiment-specific updates to the codebook. Freezing the codebook is central to enabling cross-embodiment transfer and the observed generalization to novel objects, backgrounds, and lighting conditions. We will revise the manuscript to explicitly describe this freezing step in the three-stage training section and in the method overview figure caption. revision: yes
Referee: [Experimental evaluation] Experimental results: The reported outperformance and generalization rest on more than 14,000 rollouts, yet no quantitative ablations, error bars, or sensitivity analysis are supplied for the codebook size, commitment loss weight, or cross-embodiment alignment losses. Without these controls it is difficult to attribute performance gains specifically to the dual-branch VQ-VAE rather than to data scale or other unablated factors.

Authors: We agree that additional quantitative controls would better isolate the contribution of the dual-branch VQ-VAE and UVMC. Our current results emphasize large-scale real-world validation and direct comparisons to strong baselines, but we did not report sensitivity analyses for codebook size, commitment loss weight, or alignment loss coefficients in the main text. In the revised manuscript we will add a dedicated ablation subsection (including tables) that varies codebook sizes (e.g., 512/1024/2048), commitment loss weights, and cross-embodiment alignment loss coefficients, reporting success rates with error bars computed over multiple random seeds where feasible. This will strengthen attribution of gains to the proposed UVMC design. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results on held-out rollouts are independent of training definitions

full rationale

The paper presents a three-stage training procedure (self-supervised UVMC learning via dual-branch VQ-VAE, followed by UVMC-guided pretraining on heterogeneous data, then task-specific post-training) whose value is measured by external real-world metrics: success rates over 14,000 rollouts on six embodiments and 120 tasks, plus generalization to novel objects/lighting. These metrics are not defined in terms of the VQ-VAE reconstruction loss, codebook entropy, or any fitted parameter from stage (i). No equation or claim equates a reported performance number to a quantity that was optimized during training. The codebook-freezing question raised by the skeptic is an implementation detail that would affect interpretation but does not make the reported outperformance numbers tautological by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the standard VQ-VAE reconstruction-plus-commitment objective plus the assumption that a single discrete codebook can serve as a sufficient bottleneck for both visual dynamics and low-level actions across embodiments. No new physical constants or invented particles are introduced.

free parameters (2)

codebook size and commitment loss weight
Typical VQ-VAE hyper-parameters that must be chosen to balance reconstruction fidelity against discretization; their specific values are not reported in the abstract.
number of training stages and data mixture ratios
The three-stage schedule and the relative weighting of cross-embodiment data are design choices fitted to achieve the reported performance.

axioms (1)

domain assumption A discrete latent code learned jointly from vision and motion can serve as an effective intermediate representation that bridges heterogeneous robot embodiments.
Invoked in the description of UVMC as the solution to both precise action generation and domain-gap problems.

invented entities (1)

Unified Vision-Motion Codes (UVMC) no independent evidence
purpose: Shared discrete bottleneck that aligns visual dynamics and robotic motion from mixed data sources.
New named representation introduced by the paper; no external falsifiable prediction (e.g., specific code statistics on new robots) is supplied in the abstract.

pith-pipeline@v0.9.0 · 5927 in / 1529 out tokens · 32966 ms · 2026-05-18T01:00:52.556817+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation
cs.RO 2026-05 unverdicted novelty 7.0

Demo-JEPA enables one-shot cross-embodiment imitation by mapping visual demonstrations to shared latent future trajectories that serve as subgoals for the target agent's own forward dynamics planning.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · cited by 3 Pith papers · 17 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13778--13790, 2023

work page 2023
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 1 0 (2): 0 3, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Katzschmann

Erik Bauer, Elvis Nava, and Robert K. Katzschmann. Latent action diffusion for cross-embodiment manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2025

work page 2025
[5]

Hydra: Hybrid robot actions for imitation learning

Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. In Conference on Robot Learning, pp.\ 2113--2133. PMLR, 2023

work page 2023
[6]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr \'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 4788--4795. IEEE, 2024

work page 2024
[8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta \ n eda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. _0 : A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Univla: Learning to act anywhere with task-centric latent actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. In Proceedings of Robotics: Science and Systems (RSS), 2025 b

work page 2025
[13]

Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models

Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, et al. Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

work page 2025
[14]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

GR-3 Technical Report

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report. arXiv preprint arXiv:2507.15493, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Berkeley UR5 demonstration dataset

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home

work page
[17]

Playfusion: Skill acquisition via diffusion from language-annotated play

Lili Chen, Shikhar Bahl, and Deepak Pathak. Playfusion: Skill acquisition via diffusion from language-annotated play. In Conference on Robot Learning, pp.\ 2012--2029. PMLR, 2023

work page 2012
[18]

Moto: Latent motion token as the bridging language for robot manipulation

Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2025

work page 2025
[19]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023

work page 2023
[20]

From play to policy: Conditional behavior generation from uncurated robot data

Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023
[21]

Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_jaco_play_dataset

work page 2023
[22]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. P a LM -e: An embodi...

work page 2023
[23]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in neural information processing systems, 36: 0 9156--9172, 2023

work page 2023
[24]

Bridge data: Boosting generalization of robotic skills with cross-domain datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. In Proceedings of Robotics: Science and Systems (RSS), 2022

work page 2022
[25]

Diffusion trajectory-guided policy for long-horizon robot manipulation

Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zhengping Che, Qingjie Liu, and Min Wan. Diffusion trajectory-guided policy for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 10 0 (12): 0 12788--12795, 2025. doi:10.1109/LRA.2025.3619794

work page doi:10.1109/lra.2025.3619794 2025
[26]

Finetuning offline world models in the real world

Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chandramouli Rajagopalan, and Xiaolong Wang. Finetuning offline world models in the real world. In Conference on Robot Learning (CoRL), 2023

work page 2023
[27]

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. In Conference on Robot Learning (CoRL), 2024

work page 2024
[28]

Llama-adapter v2: Parameter-efficient visual instruction model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[29]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18995--19012, 2022

work page 2022
[30]

Watch and match: Supercharging imitation with regularized optimal transport

Siddhant Haldar, Vaibhav Mathur, Denis Yarats, and Lerrel Pinto. Watch and match: Supercharging imitation with regularized optimal transport. In Conference on Robot Learning, pp.\ 32--43. PMLR, 2023

work page 2023
[31]

Learning an actionable discrete diffusion policy via large-scale actionless video pre-training

Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, and Xuelong Li. Learning an actionable discrete diffusion policy via large-scale actionless video pre-training. Advances in Neural Information Processing Systems, 37: 0 31124--31153, 2024

work page 2024
[32]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16000--16009, 2022

work page 2022
[33]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. The International Journal of Robotics Research, pp.\ 02783649241304789, 2023

work page 2023
[34]

Sacson: Scalable autonomous control for social navigation

Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters, 9 0 (1): 0 49--56, 2023

work page 2023
[35]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. _ 0.5 : a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp.\ 991--1002. PMLR, 2022

work page 2022
[38]

Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation

Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel, and Sergey Levine. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE international conference on robotics and automation (ICRA), pp.\ 5129--5136. IEEE, 2018

work page 2018
[39]

Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on robot learning, pp.\ 651--673. PMLR, 2018

work page 2018
[40]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer

Minchan Kim, Junhyek Han, Jaehyung Kim, and Beomjoon Kim. Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 10644--10651. IEEE, 2023

work page 2023
[42]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning (CoRL), 2024

work page 2024
[43]

Molmoact: Action reasoning models that can reason in space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. In CoRL 2025 Robot Data Workshop, 2025

work page 2025
[44]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. In Proceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024
[45]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.\ 19730--19742. PMLR, 2023

work page 2023
[46]

Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models. arXiv preprint arXiv:2506.03574, 2025 a

work page arXiv 2025
[47]

Unified video action model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. In Proceedings of Robotics: Science and Systems (RSS), 2025 b

work page 2025
[48]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36: 0 34892--34916, 2023

work page 2023
[49]

Robot learning on the job: Human-in-the-loop autonomy and learning during deployment

Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. The International Journal of Robotics Research, pp.\ 02783649241273901, 2022

work page 2022
[50]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025 b

work page 2025
[52]

Mla: A multisen- sory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

Zhuoyang Liu, Jiaming Liu, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen, Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, et al. Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation. arXiv preprint arXiv:2509.26642, 2025 c

work page arXiv 2025
[53]

Multistage cable routing through hierarchical imitation learning

Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multistage cable routing through hierarchical imitation learning. IEEE Transactions on Robotics, 40: 0 1476--1491, 2024

work page 2024
[54]

Fmb: a functional manipulation benchmark for generalizable robotic learning

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44 0 (4): 0 592--606, 2025

work page 2025
[55]

Interactive language: Talking to robots in real time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023

work page 2023
[56]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity

Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 1048--1055...

work page 2019
[57]

Weblab xarm dataset, 2023

Tatsuya Matsushima, Hiroki Furuta, Yusuke Iwasawa, and Yutaka Matsuo. Weblab xarm dataset, 2023

work page 2023
[58]

Grounding language with visual affordances over unstructured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstructured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023
[59]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[60]

Quest: Self-supervised skill abstractions for learning continuous control

Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions for learning continuous control. Advances in Neural Information Processing Systems, 37: 0 4062--4089, 2024

work page 2024
[61]

Learning and retrieval from prior data for skill-based imitation learning

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Proceedings of the Conference on Robot Learning (CoRL), 2022

work page 2022
[62]

X-embodiment u-tokyo pr2 datasets

Jihoon Oh, Naoaki Kanazawa, and Kento Kawaharazuka. X-embodiment u-tokyo pr2 datasets. URL https://github. com/ojh6404/rlds\_dataset\_builder, 22, 2023

work page 2023
[63]

Motion planning by learning the solution manifold in trajectory optimization

Takayuki Osa. Motion planning by learning the solution manifold in trajectory optimization. The International Journal of Robotics Research, 41 0 (3): 0 281--311, 2022

work page 2022
[64]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 6892--6903. IEEE, 2024

work page 2024
[65]

A guided reinforcement learning approach using shared control templates for learning manipulation skills in the real world

Abhishek Padalkar, Gabriel Quere, Antonin Raffin, Jo \ a o Silv \'e rio, and Freek Stulp. A guided reinforcement learning approach using shared control templates for learning manipulation skills in the real world. 2023 a

work page 2023
[66]

Guiding reinforcement learning with shared control templates

Abhishek Padalkar, Gabriel Quere, Franz Steinmetz, Antonin Raffin, Matthias Nieuwenhuisen, Jo \ a o Silv \'e rio, and Freek Stulp. Guiding reinforcement learning with shared control templates. In ICRA, pp.\ 11531--11537, 2023 b

work page 2023
[67]

The surprising effectiveness of representation learning for visual imitation

Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pandian Arunachalam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imitation. In Proceedings of Robotics: Science and Systems (RSS), 2022

work page 2022
[68]

Supramodal and cross-modal representations of working memory in higher-order cortex

Doyoung Park, Seong-Hwan Hwang, Keonwoo Lee, Yeeun Ryoo, Hyoung F Kim, and Sue-Hyun Lee. Supramodal and cross-modal representations of working memory in higher-order cortex. Nature Communications, 16 0 (1): 0 4497, 2025

work page 2025
[69]

Embodied artificial intelligence: Trends and challenges

Rolf Pfeifer and Fumiya Iida. Embodied artificial intelligence: Trends and challenges. Lecture notes in computer science, pp.\ 1--26, 2004

work page 2004
[70]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. In Proceedings of Robotics: Science and Systems (RSS), 2025

work page 2025
[71]

Shared control templates for assistive robotics

Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and J \"o rn Vogel. Shared control templates for assistive robotics. In 2020 IEEE international conference on robotics and automation (ICRA), pp.\ 1956--1962. IEEE, 2020

work page 2020
[72]

Robot learning with sensorimotor pre-training

Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, pp.\ 683--693. PMLR, 2023 a

work page 2023
[73]

Real-world robot learning with masked visual pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pp.\ 416--426. PMLR, 2023 b

work page 2023
[74]

Latent plans for task-agnostic offline reinforcement learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task-agnostic offline reinforcement learning. In Conference on Robot Learning, pp.\ 1838--1849. PMLR, 2023

work page 2023
[75]

Multi-resolution sensing for real-time control with vision-language models

Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision-language models. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023

work page 2023
[76]

Behavior transformers: Cloning k modes with one stone

Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems, 35: 0 22955--22968, 2022

work page 2022
[77]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023

work page arXiv 2023
[78]

Rapid exploration for open-world navigation with latent goal models

Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models. In Proceedings of the Conference on Robot Learning (CoRL), 2021

work page 2021
[79]

Mutex: Learning unified policies from multimodal task specifications

Rutav Shah, Roberto Mart \' n-Mart \' n, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications. In Proceedings of the Conference on Robot Learning (CoRL), 2023

work page 2023
[80]

Dense policy: Bidirectional autoregressive learning of actions

Yue Su, Xinyu Zhan, Hongjie Fang, Han Xue, Hao-Shu Fang, Yong-Lu Li, Cewu Lu, and Lixin Yang. Dense policy: Bidirectional autoregressive learning of actions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

Showing first 80 references.

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13778--13790, 2023

work page 2023

[3] [3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 1 0 (2): 0 3, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Katzschmann

Erik Bauer, Elvis Nava, and Robert K. Katzschmann. Latent action diffusion for cross-embodiment manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2025

work page 2025

[5] [5]

Hydra: Hybrid robot actions for imitation learning

Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. In Conference on Robot Learning, pp.\ 2113--2133. PMLR, 2023

work page 2023

[6] [6]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr \'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 4788--4795. IEEE, 2024

work page 2024

[8] [8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta \ n eda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. _0 : A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Univla: Learning to act anywhere with task-centric latent actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. In Proceedings of Robotics: Science and Systems (RSS), 2025 b

work page 2025

[13] [13]

Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models

Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, et al. Mamba policy: Towards efficient 3d diffusion policy with hybrid selective state models. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

work page 2025

[14] [14]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

GR-3 Technical Report

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report. arXiv preprint arXiv:2507.15493, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Berkeley UR5 demonstration dataset

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home

work page

[17] [17]

Playfusion: Skill acquisition via diffusion from language-annotated play

Lili Chen, Shikhar Bahl, and Deepak Pathak. Playfusion: Skill acquisition via diffusion from language-annotated play. In Conference on Robot Learning, pp.\ 2012--2029. PMLR, 2023

work page 2012

[18] [18]

Moto: Latent motion token as the bridging language for robot manipulation

Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2025

work page 2025

[19] [19]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023

work page 2023

[20] [20]

From play to policy: Conditional behavior generation from uncurated robot data

Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023

[21] [21]

Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_jaco_play_dataset

work page 2023

[22] [22]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. P a LM -e: An embodi...

work page 2023

[23] [23]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in neural information processing systems, 36: 0 9156--9172, 2023

work page 2023

[24] [24]

Bridge data: Boosting generalization of robotic skills with cross-domain datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. In Proceedings of Robotics: Science and Systems (RSS), 2022

work page 2022

[25] [25]

Diffusion trajectory-guided policy for long-horizon robot manipulation

Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zhengping Che, Qingjie Liu, and Min Wan. Diffusion trajectory-guided policy for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 10 0 (12): 0 12788--12795, 2025. doi:10.1109/LRA.2025.3619794

work page doi:10.1109/lra.2025.3619794 2025

[26] [26]

Finetuning offline world models in the real world

Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chandramouli Rajagopalan, and Xiaolong Wang. Finetuning offline world models in the real world. In Conference on Robot Learning (CoRL), 2023

work page 2023

[27] [27]

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. In Conference on Robot Learning (CoRL), 2024

work page 2024

[28] [28]

Llama-adapter v2: Parameter-efficient visual instruction model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024

[29] [29]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18995--19012, 2022

work page 2022

[30] [30]

Watch and match: Supercharging imitation with regularized optimal transport

Siddhant Haldar, Vaibhav Mathur, Denis Yarats, and Lerrel Pinto. Watch and match: Supercharging imitation with regularized optimal transport. In Conference on Robot Learning, pp.\ 32--43. PMLR, 2023

work page 2023

[31] [31]

Learning an actionable discrete diffusion policy via large-scale actionless video pre-training

Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, and Xuelong Li. Learning an actionable discrete diffusion policy via large-scale actionless video pre-training. Advances in Neural Information Processing Systems, 37: 0 31124--31153, 2024

work page 2024

[32] [32]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16000--16009, 2022

work page 2022

[33] [33]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. The International Journal of Robotics Research, pp.\ 02783649241304789, 2023

work page 2023

[34] [34]

Sacson: Scalable autonomous control for social navigation

Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters, 9 0 (1): 0 49--56, 2023

work page 2023

[35] [35]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. _ 0.5 : a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp.\ 991--1002. PMLR, 2022

work page 2022

[38] [38]

Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation

Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel, and Sergey Levine. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE international conference on robotics and automation (ICRA), pp.\ 5129--5136. IEEE, 2018

work page 2018

[39] [39]

Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on robot learning, pp.\ 651--673. PMLR, 2018

work page 2018

[40] [40]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer

Minchan Kim, Junhyek Han, Jaehyung Kim, and Beomjoon Kim. Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 10644--10651. IEEE, 2023

work page 2023

[42] [42]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning (CoRL), 2024

work page 2024

[43] [43]

Molmoact: Action reasoning models that can reason in space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. In CoRL 2025 Robot Data Workshop, 2025

work page 2025

[44] [44]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. In Proceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024

[45] [45]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.\ 19730--19742. PMLR, 2023

work page 2023

[46] [46]

Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models. arXiv preprint arXiv:2506.03574, 2025 a

work page arXiv 2025

[47] [47]

Unified video action model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. In Proceedings of Robotics: Science and Systems (RSS), 2025 b

work page 2025

[48] [48]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36: 0 34892--34916, 2023

work page 2023

[49] [49]

Robot learning on the job: Human-in-the-loop autonomy and learning during deployment

Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. The International Journal of Robotics Research, pp.\ 02783649241273901, 2022

work page 2022

[50] [50]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025 b

work page 2025

[52] [52]

Mla: A multisen- sory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

Zhuoyang Liu, Jiaming Liu, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen, Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, et al. Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation. arXiv preprint arXiv:2509.26642, 2025 c

work page arXiv 2025

[53] [53]

Multistage cable routing through hierarchical imitation learning

Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multistage cable routing through hierarchical imitation learning. IEEE Transactions on Robotics, 40: 0 1476--1491, 2024

work page 2024

[54] [54]

Fmb: a functional manipulation benchmark for generalizable robotic learning

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44 0 (4): 0 592--606, 2025

work page 2025

[55] [55]

Interactive language: Talking to robots in real time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023

work page 2023

[56] [56]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity

Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 1048--1055...

work page 2019

[57] [57]

Weblab xarm dataset, 2023

Tatsuya Matsushima, Hiroki Furuta, Yusuke Iwasawa, and Yutaka Matsuo. Weblab xarm dataset, 2023

work page 2023

[58] [58]

Grounding language with visual affordances over unstructured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstructured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023

[59] [59]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023

[60] [60]

Quest: Self-supervised skill abstractions for learning continuous control

Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions for learning continuous control. Advances in Neural Information Processing Systems, 37: 0 4062--4089, 2024

work page 2024

[61] [61]

Learning and retrieval from prior data for skill-based imitation learning

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Proceedings of the Conference on Robot Learning (CoRL), 2022

work page 2022

[62] [62]

X-embodiment u-tokyo pr2 datasets

Jihoon Oh, Naoaki Kanazawa, and Kento Kawaharazuka. X-embodiment u-tokyo pr2 datasets. URL https://github. com/ojh6404/rlds\_dataset\_builder, 22, 2023

work page 2023

[63] [63]

Motion planning by learning the solution manifold in trajectory optimization

Takayuki Osa. Motion planning by learning the solution manifold in trajectory optimization. The International Journal of Robotics Research, 41 0 (3): 0 281--311, 2022

work page 2022

[64] [64]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 6892--6903. IEEE, 2024

work page 2024

[65] [65]

A guided reinforcement learning approach using shared control templates for learning manipulation skills in the real world

Abhishek Padalkar, Gabriel Quere, Antonin Raffin, Jo \ a o Silv \'e rio, and Freek Stulp. A guided reinforcement learning approach using shared control templates for learning manipulation skills in the real world. 2023 a

work page 2023

[66] [66]

Guiding reinforcement learning with shared control templates

Abhishek Padalkar, Gabriel Quere, Franz Steinmetz, Antonin Raffin, Matthias Nieuwenhuisen, Jo \ a o Silv \'e rio, and Freek Stulp. Guiding reinforcement learning with shared control templates. In ICRA, pp.\ 11531--11537, 2023 b

work page 2023

[67] [67]

The surprising effectiveness of representation learning for visual imitation

Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pandian Arunachalam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imitation. In Proceedings of Robotics: Science and Systems (RSS), 2022

work page 2022

[68] [68]

Supramodal and cross-modal representations of working memory in higher-order cortex

Doyoung Park, Seong-Hwan Hwang, Keonwoo Lee, Yeeun Ryoo, Hyoung F Kim, and Sue-Hyun Lee. Supramodal and cross-modal representations of working memory in higher-order cortex. Nature Communications, 16 0 (1): 0 4497, 2025

work page 2025

[69] [69]

Embodied artificial intelligence: Trends and challenges

Rolf Pfeifer and Fumiya Iida. Embodied artificial intelligence: Trends and challenges. Lecture notes in computer science, pp.\ 1--26, 2004

work page 2004

[70] [70]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. In Proceedings of Robotics: Science and Systems (RSS), 2025

work page 2025

[71] [71]

Shared control templates for assistive robotics

Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and J \"o rn Vogel. Shared control templates for assistive robotics. In 2020 IEEE international conference on robotics and automation (ICRA), pp.\ 1956--1962. IEEE, 2020

work page 2020

[72] [72]

Robot learning with sensorimotor pre-training

Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, pp.\ 683--693. PMLR, 2023 a

work page 2023

[73] [73]

Real-world robot learning with masked visual pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pp.\ 416--426. PMLR, 2023 b

work page 2023

[74] [74]

Latent plans for task-agnostic offline reinforcement learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task-agnostic offline reinforcement learning. In Conference on Robot Learning, pp.\ 1838--1849. PMLR, 2023

work page 2023

[75] [75]

Multi-resolution sensing for real-time control with vision-language models

Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision-language models. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023

work page 2023

[76] [76]

Behavior transformers: Cloning k modes with one stone

Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems, 35: 0 22955--22968, 2022

work page 2022

[77] [77]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023

work page arXiv 2023

[78] [78]

Rapid exploration for open-world navigation with latent goal models

Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models. In Proceedings of the Conference on Robot Learning (CoRL), 2021

work page 2021

[79] [79]

Mutex: Learning unified policies from multimodal task specifications

Rutav Shah, Roberto Mart \' n-Mart \' n, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications. In Proceedings of the Conference on Robot Learning (CoRL), 2023

work page 2023

[80] [80]

Dense policy: Bidirectional autoregressive learning of actions

Yue Su, Xinyu Zhan, Hongjie Fang, Han Xue, Hao-Shu Fang, Yong-Lu Li, Cewu Lu, and Lixin Yang. Dense policy: Bidirectional autoregressive learning of actions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025