pith. sign in

arxiv: 2606.28133 · v1 · pith:3KG3QEOOnew · submitted 2026-06-26 · 💻 cs.RO · cs.CV

Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots

Pith reviewed 2026-06-29 04:20 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords human-to-robot transfermanipulation skillsbridging actionwrist translationvision-language-action modelbi-manual robotparallel grippersaction representation
0
0 comments X

The pith

Relative wrist translation bridges human and robot manipulation skills more effectively than full 6DoF poses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to transfer manipulation skills learned from human actions to a bi-manual robot equipped with parallel grippers. Human demonstrations are plentiful yet hard to use directly because 6DoF hand-pose estimates contain noise and finger contact patterns differ from gripper behavior. The authors replace full pose signals with a simpler shared representation: relative wrist translation expressed in the robot's initial head-camera frame. A vision-language-action model is adapted with interleaved action tokens and attention masking so that missing action components do not break training. On new bi-manual tasks this representation transfers knowledge more successfully than noisy 6DoF data and improves further when more human demonstrations are added.

Core claim

By representing actions as relative wrist translation within the initial head-camera frame, human manipulation data can be used directly to train policies for robots with parallel grippers, outperforming approaches that rely on noisy 6DoF human action estimates and scaling with the quantity of human data.

What carries the argument

The bridging action representation of relative wrist translation in the initial head-camera frame, realized through a π0-like vision-language-action model that uses interleaved action tokens and attention masking.

If this is right

  • Human demonstrations become directly usable for robot learning without embodiment-specific retargeting of rotations or contacts.
  • Full 6DoF human action signals are sub-optimal for transfer to parallel-gripper robots.
  • The method succeeds on a collection of previously unseen bi-manual manipulation tasks.
  • Task performance increases as the volume of available human data grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result implies that many manipulation tasks are dominated by spatial translation intent rather than precise orientation or finger configuration.
  • A practical next step would be to test whether a small amount of learned rotation correction on top of the translation signal further improves transfer without reintroducing noise.
  • The same translation-only framing could be examined for single-arm or mobile manipulators where wrist position remains the dominant control variable.

Load-bearing premise

Relative wrist translation alone is sufficient to capture the manipulation skills needed for the target tasks despite differences in contact patterns and gripper geometry.

What would settle it

A controlled experiment in which models trained on identical volumes of human data show that 6DoF pose inputs produce higher success rates than the wrist-translation inputs on the same task suite would falsify the central claim.

read the original abstract

We study whether we can learn novel manipulation skills from human actions to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it one of the most promising resources for scaling up robot learning. Yet transferring skills from humans to robots remains hard: most prior work treats humans as just another bi-manual 6DoF embodiment, where hand-pose estimates are noisy and the contact patterns of human fingers differ fundamentally from those of a parallel gripper. We argue that learning rotation-inclusive action signals from human data is therefore sub-optimal, and instead propose a bridging action representation: the relative wrist translation within the initial head-camera frame, an action space shared by humans and robots. To handle the potential absence of certain action components in different embodiments, we build a $\pi_0$-like vision-language-action model with interleaved action tokens and attention masking. On a suite of novel bi-manual manipulation tasks, our bridging action transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes relative wrist translation in the initial head-camera frame as a bridging action space to transfer manipulation skills from human demonstrations to a bi-manual robot equipped with parallel grippers. A vision-language-action model with interleaved action tokens and attention masking is used to handle missing action components across embodiments. The central claim is that this bridging action transfers human knowledge far more effectively than noisy 6DoF human actions and scales with the volume of human data, evaluated on a suite of novel bi-manual manipulation tasks.

Significance. If the empirical results hold, the work would be significant for scalable robot learning: human data is abundant and cheap, and a shared translational action space that sidesteps noisy 6DoF pose estimation and finger-gripper contact mismatches could enable more reliable imitation from human sources. The attention-masking mechanism for partial action spaces is a practical engineering choice that may generalize to other embodiment gaps.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim that the bridging action 'transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data' is stated without any quantitative results, baselines, error bars, dataset sizes, or task definitions, preventing assessment of evidence strength.
  2. [Bridging action representation] Bridging action representation: dropping rotation and finger DOFs to avoid noise, then relying on relative wrist translation plus vision, assumes target tasks are dominated by translational motion recoverable from images. For rotation- or grasp-geometry-dependent tasks (e.g., key insertion or handle turning), omitted signals cannot be recovered, rendering reported superiority over 6DoF baselines potentially task-dependent rather than general. This assumption is load-bearing for the transfer claims.
minor comments (1)
  1. [Model architecture] The reference to a 'π0-like' vision-language-action model should include an explicit citation to the original π0 paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that the bridging action 'transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data' is stated without any quantitative results, baselines, error bars, dataset sizes, or task definitions, preventing assessment of evidence strength.

    Authors: We agree that the abstract would be strengthened by including indicative quantitative details. In the revised manuscript we will incorporate concise references to key results (e.g., success-rate deltas versus the 6DoF baseline and the scale of human data used) while remaining within abstract length limits. revision: yes

  2. Referee: [Bridging action representation] Bridging action representation: dropping rotation and finger DOFs to avoid noise, then relying on relative wrist translation plus vision, assumes target tasks are dominated by translational motion recoverable from images. For rotation- or grasp-geometry-dependent tasks (e.g., key insertion or handle turning), omitted signals cannot be recovered, rendering reported superiority over 6DoF baselines potentially task-dependent rather than general. This assumption is load-bearing for the transfer claims.

    Authors: The referee correctly identifies a scope limitation. Our central argument is that noisy 6DoF human pose estimates and mismatched finger-gripper contacts make direct rotation-inclusive imitation suboptimal; the relative wrist translation in the initial head-camera frame supplies a clean, embodiment-shared signal while the interleaved VLA model uses visual observations to recover orientation and grasp details. The reported gains hold on the evaluated bi-manual suite, which emphasizes translational components. We will add an explicit limitations paragraph discussing rotation-heavy tasks (e.g., key insertion) and note that the bridging action is not claimed to be universal. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential predictions

full rationale

The paper advances an empirical claim that relative wrist translation as a bridging action outperforms noisy 6DoF human actions for transferring manipulation skills, evaluated on novel bi-manual tasks with a VLA model. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The action-space choice is presented as a design decision to avoid noise, not derived from prior results by the same authors. The central result is an experimental comparison that does not reduce to its inputs by construction; external benchmarks (task success rates) remain independent. This is the expected non-finding for a purely empirical robotics paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that wrist translation is a sufficient shared action space and that attention masking can compensate for embodiment gaps without additional learned components.

pith-pipeline@v0.9.1-grok · 5750 in / 1038 out tokens · 27620 ms · 2026-06-29T04:20:03.139057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 45 canonical work pages · 31 internal anchors

  1. [1]

    Human-to-robot imitation in the wild.arXiv preprint arXiv:2207.09450, 2022

    Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild.arXiv preprint arXiv:2207.09450, 2022

  2. [2]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

  4. [4]

    A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

  5. [5]

    Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

    Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

  6. [6]

    H-rdt: Human manipulation enhanced bimanual robotic manipulation

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  8. [8]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  9. [9]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  10. [10]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

  11. [11]

    In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

    Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

  12. [12]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

  13. [13]

    GR-3 Technical Report

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

  14. [14]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

  15. [15]

    Moto: Latent motion token as the bridging language for learning robot manipulation from videos

    Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

  16. [16]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 15

  17. [17]

    XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

    Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, et al. Xr-1: Towards versatile vision-language-action models via learning unified vision-motion representations. arXiv preprint arXiv:2511.02776, 2025

  18. [18]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017

  19. [19]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  20. [20]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  21. [21]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  22. [22]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  23. [23]

    RoboMIND 2.0: A multimodal, bimanual mobile ma- nipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

    Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence. arXiv preprint arXiv:2512.24653, 2025

  24. [24]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024

  25. [25]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  26. [26]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  27. [27]

    Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

    Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

  28. [28]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  29. [29]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  30. [30]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  31. [31]

    Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024

  32. [32]

    Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996, 2024

  33. [33]

    A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

    Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026. 16

  34. [34]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  35. [35]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  36. [36]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  37. [37]

    Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

  38. [38]

    Being-h0

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

  39. [39]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019

  40. [40]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  41. [41]

    Reconstructing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024

  42. [42]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  43. [43]

    Egobridge: Domain adaptation for generalizable imitation from egocentric human data

    Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In Human to Robot: Workshopon Sensorizing, Modeling, and Learning from Humans, 2025

  44. [44]

    Humanoid policy˜human policy.arXiv preprint arXiv:2503.13441, 2025

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy˜human policy.arXiv preprint arXiv:2503.13441, 2025

  45. [45]

    Romero, D

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610, 2022

  46. [46]

    EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

    Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, and Li Chen. Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration.arXiv preprint arXiv:2602.10106, 2026

  47. [47]

    Dexwild: Dexterous human interactions for in-the-wild robot policies.Robotics: Science and Systems (RSS), 2025

    Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, and Deepak Pathak. Dexwild: Dexterous human interactions for in-the-wild robot policies.Robotics: Science and Systems (RSS), 2025

  48. [48]

    Gen-0: Embodied foundation models that scale with physical interaction.Generalist AI Blog, 2025

    Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction.Generalist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025-GEN-0

  49. [49]

    Gen-1: Scaling embodied foundation models to mastery

    Generalist AI Team. Gen-1: Scaling embodied foundation models to mastery. Generalist AI Blog, 2026. https://generalistai.com/blog/apr-02-2026-GEN-1

  50. [50]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprintarXiv:2405.12213, 2024

  51. [51]

    Attention is all you need.Advancesin neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

  52. [52]

    Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023

    Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023. 17

  53. [53]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. Advancesin neural information processing systems, 37:124420–124450, 2024

  54. [54]

    One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation

    Zhenyu Wei, Yunchao Yao, and Mingyu Ding. One hand to rule them all: Canonical representations for unified dexterous manipulation. arXiv preprint arXiv:2602.16712, 2026

  55. [55]

    Any-point Trajectory Modeling for Policy Learning

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023

  56. [56]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

  57. [57]

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

  58. [58]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

  59. [59]

    A Pragmatic VLA Foundation Model

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  60. [60]

    Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

    Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, and Shuran Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

  61. [61]

    Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

    Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation.arXiv preprint arXiv:2402.19432, 2024

  62. [62]

    EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

    Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  63. [63]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

  64. [64]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  65. [65]

    Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

    Chengbo Yuan, Rui Zhou, Mengzhen Liu, Yingdong Hu, Shengjie Wang, Li Yi, Chuan Wen, Shanghang Zhang, and Yang Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

  66. [66]

    Egoscale: Scaling dexterous manipulation with diverse egocentric human data

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data. arXiv preprint arXiv:2602.16710, 2026

  67. [67]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  68. [68]

    Emma: Scaling mobile manipulation via egocentric human data

    Lawrence Y Zhu, Pranav Kuppili, Ryan Punamiya, Patcharapong Aphiwetsa, Dhruv Patel, Simar Kareer, Sehoon Ha, and Danfei Xu. Emma: Scaling mobile manipulation via egocentric human data. IEEE Robotics and Automation Letters, 2026

  69. [69]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 18