pith. sign in

arxiv: 2606.06627 · v1 · pith:34GEWUANnew · submitted 2026-06-04 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

Pith reviewed 2026-06-28 01:01 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords robot manipulationcotraininghuman videoembodiment specializationmotion gappolicy learningeveryday videos
0
0 comments X

The pith

Specializing vision and policy networks to each embodiment bridges the motion gap and enables cotraining on everyday human videos to raise robot manipulation success rates by 29.7 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests what allows robot policies to learn from plentiful everyday human videos instead of curated demonstrations that already look like robot motions. Using a new set of 532 videos with precise triangulated hand labels and natural movements, the authors show that accurate hand pose helps but is not enough on its own. The decisive step is to let the vision and policy networks specialize separately to human and robot embodiments so they can handle the remaining differences in how people and robots move. This recipe produces steady gains across six tasks, most noticeably when the robot has little of its own data.

Core claim

Even with high-quality hand labels from natural everyday videos, transfer to robot policies fails because of the motion gap; the vision and policy networks must be specialized to each embodiment before cotraining yields reliable improvement, delivering an absolute success-rate increase of 29.7 percent in the low-robot-data regime.

What carries the argument

Specialization of the vision and policy networks to human versus robot embodiments, which lets the model absorb shared visual and task knowledge while routing embodiment-specific motion patterns through separate pathways.

If this is right

  • The method delivers consistent gains on six different manipulation tasks.
  • The largest benefits appear when the amount of robot data is small.
  • Everyday Internet videos become usable for robot learning once the specialization step is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Larger collections of unlabeled everyday video could be mixed in to further cut the number of robot demonstrations needed.
  • The same specialization pattern might be tested on navigation or mobile manipulation where embodiment differences are also large.
  • An automatic test for when specialization helps versus when joint training suffices could reduce manual tuning.

Load-bearing premise

The motion gap between natural human actions and robot behavior can be overcome by letting the vision and policy networks specialize to each embodiment.

What would settle it

Running the same six tasks with the cotraining recipe but without network specialization and finding zero or negative change in success rate would falsify the claim that specialization is required for the reported gains.

Figures

Figures reproduced from arXiv: 2606.06627 by Aditya Prakash, Andrew Wen, Pulkit Agrawal, Richard Li, Saurabh Gupta, Yilun Du.

Figure 1
Figure 1. Figure 1: Top: System diagram showing data processing and policy cotraining steps. Bottom: Rollouts from cotrained policy manipulating unseen objects in unseen scenes. See interactive visualizations and videos: https://richardrl.github.io/ what-matters-cotraining-human-videos/. To enable controlled study, we construct a dataset of everyday human videos with high-quality 3D hand poses by triangulating EgoExo4D [6] wi… view at source ↗
Figure 2
Figure 2. Figure 2: Input-output diagram for inference over human and robot data in our conditional flow [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Robot TCP frame vs. human TCP frame (not to scale). For the network to learn shared representations across embodiments, the human and robot action marginals must share the same support. We map the human 3D hand pose to the robot action space by picking the “middle finger proximal” MANO joint frame with a rotation to match the robot TCP frame ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-task comparison between human cotraining with triangulated hands and robot-only [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task visualizations. Orange arrow shows representative object motion. Data sources: Following data guidelines from prior work on robot scaling laws [35], we collect 50 demonstrations per environment. For each task, we have 10 training environments, resulting in a total of 500 demos per task. For approaches where we cotrain with human data in addition to the robot data, we always use RGB images sampled from… view at source ↗
Figure 6
Figure 6. Figure 6: HC rollouts on unseen environments. Other ablation experiments Scale-aligning the human fisheye images with an extent￾matched pinhole camera doubles performance over Robot Only ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top-row are ego-view projected keypoints from our 3D hands, and bottom row are corre [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TriHands triangulation visualizations (first five scenes). [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TriHands triangulation visualizations (remaining four scenes). [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of robot and human action marginals in XYZ camera coordinates across all [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training environments across first three tasks ( [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Training environments across first three tasks ( [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training environments across last three tasks ( [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Training environments across last three tasks ( [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Test environments across first three tasks ( [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Test environments across last three tasks ( [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Tablecloths used during training/testing (labelled 1–12). 1-4, 6-10 were used in training [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Spatulas used during training/testing (labelled 1-12). 1-10 were used in training and [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Mugs used during training/testing (labelled 1–13). 1-10 were used in training for both [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Bowls used during training/testing (labelled 1–10). Pairs [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Books used during training/testing (labelled 1–12). 1-10 were used in training and 11-12 [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
read the original abstract

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces a dataset of 532 everyday human videos (28 hours) with high-quality triangulated 3D hand labels and investigates factors for cotraining robot manipulation policies. It reports that hand-pose quality matters for transfer but that the motion gap between human and robot embodiments is only overcome when vision and policy networks are specialized per embodiment; their cotraining recipe then produces a 29.7% absolute success-rate gain in the low-robot-data regime across six manipulation tasks.

Significance. If the controlled ablations hold, the work supplies concrete, actionable guidance on using abundant uncurated human video for robot learning and releases a new labeled dataset that can serve as a benchmark. The emphasis on embodiment specialization and the quantitative gains in the low-data regime are directly useful to the robot-manipulation community.

minor comments (3)
  1. [Abstract] Abstract: the headline 29.7% gain is stated without reference to the number of trials, error bars, or the precise low-robot-data baseline, making the central empirical claim harder to evaluate at a glance.
  2. [Dataset / Experiments] §4 (or wherever the dataset is introduced): the paper should explicitly state the total number of robot demonstrations used in the low-data regime and the exact train/validation/test splits for the six tasks so that the 29.7% figure can be reproduced.
  3. [Results] Figure captions and tables reporting success rates should include the number of evaluation episodes and standard deviations; this is especially important given the claim of consistent improvements.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The summary accurately captures the contributions of the dataset and the key finding on embodiment specialization for cotraining.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports empirical results from a new dataset of 532 everyday human videos with triangulated hand labels, plus controlled ablations across six tasks. Central claims (hand-pose quality matters, embodiment specialization is required to bridge motion gap, cotraining yields 29.7% absolute gain in low-robot-data regime) are direct measurements on fresh data rather than quantities defined by prior fits, self-citations, or ansatzes. No derivation chain reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that everyday human videos contain transferable signals once embodiment specialization is applied, plus the new dataset's hand labels being sufficiently accurate. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Everyday internet videos of human actions contain useful signals for robot manipulation policies despite natural motion differences.
    The investigation of transfer factors presupposes this premise to motivate the study of specialization.

pith-pipeline@v0.9.1-grok · 5678 in / 1188 out tokens · 40724 ms · 2026-06-28T01:01:32.386535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 12 canonical work pages · 10 internal anchors

  1. [1]

    J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. InIEEE International Conference on Robotics and Automation, ICRA, 2025

  2. [2]

    Bharadhwaj, A

    H. Bharadhwaj, A. Gupta, V . Kumar, and S. Tulsiani. Towards generalizable zero-shot manip- ulation via translating human interaction plans. InIEEE International Conference on Robotics and Automation, ICRA, 2024

  3. [3]

    Bharadhwaj, R

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InProceedings of the European Conference on Computer Vision (ECCV), 2024

  4. [4]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. InIEEE International Conference on Robotics and Automation, ICRA, 2025

  5. [5]

    Punamiya, D

    R. Punamiya, D. Patel, P. Aphiwetsa, P. Kuppili, L. Y . Zhu, S. Kareer, J. Hoffman, and D. Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025

  6. [6]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

  7. [7]

    Zhang, J

    J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruction from egocentric videos.arXiv preprint arXiv:2501.02973, 2025

  8. [8]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  9. [9]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164

  10. [10]

    Z. Fan, O. Taheri, D. Tzionas, M. Kocabas, M. Kaufmann, M. J. Black, and O. Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023

  11. [11]

    X. Wang, T. Kwon, M. Rad, B. Pan, I. Chakraborty, S. Andrist, D. Bohus, A. Feniello, B. Tekin, F. V . Frujeri, N. Joshi, and M. Pollefeys. Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world. InIEEE/CVF International Conference on Com- puter Vision, ICCV, 2023

  12. [12]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 9

  13. [13]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

  14. [14]

    M. K. Srirama, S. Dasari, S. Bahl, and A. Gupta. HRP: human affordances for robotic pre- training. InRobotics: Science and Systems (RSS), 2024

  15. [15]

    Kannan, K

    A. Kannan, K. Shaw, S. Bahl, P. Mannam, and D. Pathak. DEFT: dexterous fine-tuning for hand policies. InConference on Robot Learning, (CoRL), 2023

  16. [16]

    Mendonca, S

    R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. In Robotics: Science and Systems (RSS), 2023

  17. [17]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2023

  18. [18]

    Goyal, S

    M. Goyal, S. Modi, R. Goyal, and S. Gupta. Human hands as probes for interactive object understanding. InComputer Vision and Pattern Recognition (CVPR), 2022

  19. [19]

    Chang, A

    M. Chang, A. Prakash, and S. Gupta. Look ma, no hands! agent-environment factorization of egocentric videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  20. [20]

    C. Wen, X. Lin, J. I. R. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems (RSS), 2024

  21. [21]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, H. Yin, S. Liu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  22. [22]

    T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human interac- tions for in-the-wild robot policies.arxiv:2505.07813, 2025

  23. [23]

    R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

  24. [24]

    Sengupta, F

    A. Sengupta, F. Jin, R. Zhang, and S. Cao. mm-pose: Real-time human skeletal posture esti- mation using mmwave radars and cnns.IEEE sensors journal, 20(17):10032–10044, 2020

  25. [25]

    Romero, D

    J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017

  26. [26]

    Hartley and A

    R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. Cambridge Uni- versity Press, Cambridge, UK, 2nd edition, 2003

  27. [27]

    Tedrake.Robotic Manipulation

    R. Tedrake.Robotic Manipulation. 2024. URLhttp://manipulation.mit.edu

  28. [28]

    Kannala and S

    J. Kannala and S. Brandt. A generic camera calibration method for fish-eye lenses. InProceed- ings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 1, pages 10–13. IEEE, 2004

  29. [29]

    Szeliski.Computer vision: algorithms and applications

    R. Szeliski.Computer vision: algorithms and applications. Springer Nature, 2022

  30. [30]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

  31. [31]

    Teed and J

    Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021. 10

  32. [32]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

    D. Maggio, H. Lim, and L. Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

  33. [33]

    C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettle- moyer, and O. Levy. Transfusion: Predict the next token and diffuse images with one multi- modal model.arXiv preprint arXiv:2408.11039, 2024

  34. [34]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  35. [35]

    Y . Hu, F. Lin, P. Sheng, C. Wen, J. You, and Y . Gao. Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647, 2024

  36. [36]

    A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

    J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

  37. [37]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  38. [38]

    R. A. Potamias, J. Zhang, J. Deng, and S. Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025

  39. [39]

    Cheng, D

    T. Cheng, D. Shan, A. Hassen, R. Higgins, and D. Fouhey. Towards a richer 2d understanding of hands at scale.Advances in Neural Information Processing Systems, 36:30453–30465, 2023. 11 A Hand label comparison Ego-Exo4D provides 2D keypoint labels but does not release its fine-tuned MMPose model. There- fore, to demonstrate that our 3D-projected keypoints ...

  40. [40]

    We found the default Ultralytics YOLO v11 bounding box estimator included 12 Algorithm 1Multiview Hand Triangulation Pipeline Require:Multi-view videoV={V ego, Vexo1 ,

    Most modern 3D hand pose estimators begin reconstruction from a cropped image of just the hand(s). We found the default Ultralytics YOLO v11 bounding box estimator included 12 Algorithm 1Multiview Hand Triangulation Pipeline Require:Multi-view videoV={V ego, Vexo1 , . . . , VexoN }, intrinsics{K c}, extrinsics{T c}for each camerac Ensure:3D keypointsJ 3D,...

  41. [41]

    Due to our swapping of the bounding box estimator, and due to different padding conven- tions for Ultralytics YOLO and Hands23, the cropped images going into WiLoR became out-of-distribution, and 3D hand reconstructions were of poor quality. We addressed this by training a small bounding box translation model that took the chirality, bounding box ex- tent...

  42. [42]

    To convert this into MANO parameters, we run GPU-batched inverse kinematics through the MANO model

    Our triangulation pipeline produces triangulated 3D joints. To convert this into MANO parameters, we run GPU-batched inverse kinematics through the MANO model. Due to nonconvexity, this optimization often fails from random initialization - we initialize the optimization by using MANO parameter prediction from HaWoR [7], which has relatively accurately met...