pith. sign in

arxiv: 2606.32009 · v1 · pith:3G6UONAUnew · submitted 2026-06-30 · 💻 cs.RO

Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments

Pith reviewed 2026-07-01 04:59 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid robotvision-language-actionhuman-to-robot transferego-exo videoinverse kinematicsdemonstration scalingbimanual manipulation
0
0 comments X

The pith

Human ego-exo videos convert into executable humanoid actions that train policies deployable on real robots without any target-task robot data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a pipeline that turns synchronized ego and exo human videos into controller-aligned action labels for a 60-DoF upper-body humanoid. It recovers human motion, retargets it through staged inverse kinematics, and supervises training with forward kinematics to keep wrist and fingertip positions accurate. Policies trained only on these converted labels then run on the physical robot for multiple manipulation tasks. Readers would care because teleoperation data for high-DoF humanoids is slow to collect while human videos already exist at large scale. If the conversion preserves enough geometry, it multiplies usable training data by several times without new robot demonstrations for each task.

Core claim

Human-as-Humanoid converts ego-exo human videos into 60-DoF action chunks for the PrimeU humanoid by retargeting recovered motion through staged inverse kinematics and applying forward-kinematics-aware supervision, so that vision-language-action policies trained solely on the converted human labels generalize to real-robot deployment on downstream tasks without any target-task robot demonstrations.

What carries the argument

Staged inverse kinematics retargeting with forward-kinematics-aware supervision on a human-aligned 60-DoF upper-body embodiment that produces controller-aligned action chunks from human motion.

Load-bearing premise

The retargeted action chunks keep task-space geometry close enough to the human demonstrations that policies succeed on the physical robot without further robot data for the target task.

What would settle it

A side-by-side test on one manipulation task in which success rate of the policy trained only on converted human labels falls significantly below the rate achieved by an otherwise identical policy trained on real teleoperated robot demonstrations for the same task.

read the original abstract

Vision-language-action (VLA) models across robot embodiments require high-quality observation--action supervision to learn deployable action distributions, yet scaling such robot data remains difficult, especially for high-DoF humanoids. Teleoperation provides controller-aligned supervision, while human egocentric videos capture diverse bimanual manipulation but do not directly provide executable robot actions. We introduce Human-as-Humanoid, a human-to-humanoid supervision framework that enables near-real-time human-centric action generation, making human demonstrations usable for high-DoF humanoid VLA training by jointly aligning the robot embodiment, the sensing setup, and the action-label interface. Built on PrimeU, a human-aligned 60-DoF upper-body humanoid, Human-as-Humanoid uses synchronized ego-exo videos to pair deployment-aligned egocentric observations with exocentric motion recovery, retargets the recovered human motion through staged Inverse Kinematics (IK) into controller-aligned 60-DoF action chunks, and trains the VLA model with Forward Kinematics (FK)-aware supervision to preserve wrist and fingertip task-space geometry. This converts large-scale human demonstrations from visual observations into executable observation--action supervision for the target humanoid. Experiments validate the conversion chain at the motion-recovery, robot-action-space, and real-robot deployment levels. Human-as-Humanoid yields a 4.8--7.2x raw demonstration-throughput gain over humanoid teleoperation in our data-collection analysis, and on several downstream tasks, policies post-trained only with the converted human labels generalize to real-robot deployment without target-task robot demonstrations. The official project website is available at https://zgc-embodyai.github.io/Human-as-Humanoid.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Human-as-Humanoid, a framework that converts synchronized ego-exo human videos into executable 60-DoF action chunks for the PrimeU humanoid via staged inverse kinematics retargeting and forward-kinematics-aware supervision. This produces observation-action pairs for VLA training that enable zero-shot policy deployment on real-robot tasks without target-task robot demonstrations. The work reports a 4.8-7.2x raw demonstration-throughput gain over teleoperation and claims validation of the conversion pipeline at motion-recovery, action-space, and deployment levels.

Significance. If the empirical results hold, the approach could meaningfully address data scaling challenges for high-DoF humanoid VLA models by repurposing large-scale human video corpora, with the human-aligned embodiment and throughput analysis as concrete strengths.

major comments (2)
  1. [§4] §4 (Experiments): The manuscript asserts validation of the conversion chain at motion-recovery, robot-action-space, and real-robot deployment levels plus zero-shot generalization without target-task robot data, yet reports no quantitative metrics (e.g., success rates, pose errors, or ablation tables) or description of how generalization was measured. This directly affects assessment of the central claim.
  2. [§3.2] §3.2 (Retargeting pipeline): The claim that staged IK plus FK-aware supervision produces action chunks whose task-space geometry is sufficiently close for zero-shot transfer rests on the unquantified assumption that wrist/fingertip errors remain below task thresholds; no error distribution or sensitivity analysis is provided to support this.
minor comments (1)
  1. [Abstract] The abstract mentions the project website but the main text does not cross-reference specific figures or tables that would allow readers to locate the supporting results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point-by-point below, providing clarifications and committing to additions that strengthen the quantitative support for our claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The manuscript asserts validation of the conversion chain at motion-recovery, robot-action-space, and real-robot deployment levels plus zero-shot generalization without target-task robot data, yet reports no quantitative metrics (e.g., success rates, pose errors, or ablation tables) or description of how generalization was measured. This directly affects assessment of the central claim.

    Authors: We acknowledge that the current presentation emphasizes the throughput gain (4.8-7.2x) and qualitative deployment outcomes without tabulating per-task success rates, pose errors, or ablation studies. While the manuscript describes validation across the three levels and reports successful zero-shot real-robot deployment on downstream tasks, we agree that explicit quantitative metrics and a clear description of the generalization evaluation protocol would allow better assessment. In the revised manuscript we will add success-rate tables, pose-error statistics, and ablation results with details on how zero-shot generalization was measured. revision: yes

  2. Referee: [§3.2] §3.2 (Retargeting pipeline): The claim that staged IK plus FK-aware supervision produces action chunks whose task-space geometry is sufficiently close for zero-shot transfer rests on the unquantified assumption that wrist/fingertip errors remain below task thresholds; no error distribution or sensitivity analysis is provided to support this.

    Authors: We agree that the manuscript would be strengthened by quantifying the wrist/fingertip retargeting errors and demonstrating that they remain below task-relevant thresholds. The current text relies on the overall deployment success and FK-aware supervision design, but does not include error distributions or sensitivity analysis. We will incorporate these analyses (error histograms and sensitivity plots) in the revised version to directly support the zero-shot transfer assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline (ego-exo video capture, staged IK retargeting to 60-DoF PrimeU actions, FK-aware supervision) whose effectiveness is asserted via measured throughput gains and real-robot deployment results on downstream tasks. No derivation chain, equation, or uniqueness claim is present that reduces by construction to fitted inputs, self-citations, or renamed ansatzes; the central claims rest on experimental validation stages rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on unverified assumptions about the fidelity of video-based motion recovery and the accuracy of staged IK retargeting for preserving task geometry; these are domain assumptions rather than derived quantities.

axioms (2)
  • domain assumption Synchronized ego-exo videos allow accurate recovery of human upper-body motion suitable for retargeting.
    The method begins with exocentric motion recovery from the paired videos.
  • domain assumption Staged inverse kinematics followed by FK-aware supervision maps recovered human motion to robot actions without loss of deployable task performance.
    This mapping is the core conversion step that produces the training labels.
invented entities (1)
  • PrimeU 60-DoF upper-body humanoid no independent evidence
    purpose: Target embodiment whose controller-aligned action space receives the retargeted labels.
    The robot is introduced as the deployment platform; no independent evidence of its kinematics is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5908 in / 1551 out tokens · 51177 ms · 2026-07-01T04:59:43.306060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 13 internal anchors

  1. [1]

    AgiBot-World-Contributors. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.IROS, pages 3549–3556, 2025.https://api.semanticscholar.org/CorpusID:276902669. Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-RDT: Human manipulation enhanced bimanual robotic manipul...

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

  5. [5]

    In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

    Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-N-On: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704,

  6. [6]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329,

  7. [7]

    METIS: Multi-source egocentric training for integrated dexterous vision-language-action model.arXiv preprint arXiv:2511.17366,

    Yankai Fu, Ning Chen, Junkai Zhao, Shaozhe Shan, Guocai Yao, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. METIS: Multi-source egocentric training for integrated dexterous vision-language-action model.arXiv preprint arXiv:2511.17366,

  8. [8]

    arXiv preprint arXiv:2406.10454 , year=

    Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024a. Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024b. Claire C. Gordon, C...

  9. [9]

    HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

    Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos, et al. Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint arXiv:2605.24934,

  10. [10]

    arXiv preprint arXiv:2406.08858 , year=

    Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858,

  11. [11]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709,

  12. [12]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  13. [13]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

  14. [14]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

  15. [15]

    VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

    Qixiu Li, Yu Deng, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571,

  16. [16]

    PhysBrain: Human egocentric data as a bridge from vision language models to physical intelligence

    Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, et al. PhysBrain: Human egocentric data as a bridge from vision language models to physical intelligence. arXiv preprint arXiv:2512.16793,

  17. [17]

    ActiveMimic: Egocentric Video Pretraining with Active Perception

    Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, and Yu-Gang Jiang. Activemimic: Egocentric video pretraining with active perception.arXiv preprint arXiv:2606.06194,

  18. [18]

    EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

    Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, and Danfei Xu. Egoengine: From egocentric human videos to high-fidelity dexterous robot demonstrations.arXiv preprint arXiv:2606.12604,

  19. [19]

    Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

    Hao Luo, Yicheng Feng, et al. Being-H0: Vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597,

  20. [20]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

  21. [21]

    Dexcap: Scalable and portable mocap data collection system for dexterous manipulation,

    Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788,

  22. [22]

    EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

    Ruihan Yang, Qinxi Yu, et al. EgoVLA: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440,

  23. [23]

    Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,

    Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,

  24. [24]

    Egoscale: Scaling dexterous manipulation with diverse egocentric human data

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data. arXiv preprint arXiv:2602.16710,