pith. machine review for the scientific record. sign in

arxiv: 2605.03637 · v1 · submitted 2026-05-05 · 💻 cs.RO

Recognition: unknown

Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:40 UTC · model grok-4.3

classification 💻 cs.RO
keywords cross-embodiment video editingdisentangled representationsvideo diffusion modelsrobot learning from human videoscontrastive learningembodiment gapgenerative models for roboticslatent factorization
0
0 comments X

The pith

By factorizing videos into independent task and embodiment latents, a new editing method converts a single human demonstration into a coherent robot execution video without any paired cross-embodiment data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the distribution shift that prevents direct use of human videos for robot manipulation learning. It establishes that task content and embodiment kinematics can be separated into two orthogonal latent spaces by training with a dual contrastive objective that minimizes mutual information between the spaces while maximizing consistency inside each space. These separated codes are then supplied to a frozen video diffusion model through a lightweight adapter, allowing the model to synthesize new videos in which the task is preserved but the body is replaced by a robot. If the separation holds, robots could learn from the large existing collections of human demonstration videos on the internet instead of requiring matched robot recordings for every task.

Core claim

Our method factorizes a demonstration video into two orthogonal latent spaces by enforcing a dual contrastive objective: it minimizes mutual information between the spaces to ensure independence while maximizing intra-space consistency to create stable representations. A parameter-efficient adapter injects these latent codes into a frozen video diffusion model, enabling the synthesis of a coherent robot execution video from a single human demonstration, without requiring paired cross-embodiment data.

What carries the argument

Dual contrastive objective that creates orthogonal task and embodiment latent spaces, combined with a parameter-efficient adapter that injects the codes into a frozen video diffusion model.

If this is right

  • The generated videos are temporally consistent and morphologically accurate for the target robot.
  • The method works from only a single human demonstration video.
  • No paired human-robot examples are needed during training or inference.
  • Internet-scale human video collections become usable for robot learning.
  • The approach produces coherent robot demonstrations suitable for downstream imitation learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorization could support video editing across other embodiment changes, such as different robot morphologies or animated characters.
  • Policies trained directly on the edited videos might inherit better generalization than those trained on limited robot data.
  • The separation might reduce the cost of robot data collection by substituting public human videos for many tasks.

Load-bearing premise

That a dual contrastive objective applied to video latents will reliably separate task content from embodiment kinematics without any paired cross-embodiment examples.

What would settle it

A case in which the generated robot video shows incorrect limb proportions or joint trajectories that do not match the target robot morphology while still attempting the demonstrated task.

Figures

Figures reproduced from arXiv: 2605.03637 by Joni Pajarinen, Pekka Marttinen, Wenshuai Zhao, Wenyan Yang, Yuanpeng Tu, Yue Ma, Zhiyuan Li.

Figure 1
Figure 1. Figure 1: Framework Overview. Our framework for cross-embodiment video editing learns to disentangle a demonstration into two orthogonal latent representations. First, a trainable Task Encoder creates an embodiment-invariant Task Embedding from the text description, hand motion, and object trajectory. Simultaneously, an Embodiment Encoder generates an Embodiment Embedding from a static image of the agent’s end-effec… view at source ↗
Figure 2
Figure 2. Figure 2: The Dual Contrastive Learning Objective. Our objective structures the latent space by simultaneously minimizing mutual information between the task and embodiment spaces (Ldistangle) while maximizing intra-embodiment-space consistency (Lemb contrast) and maximizing intra-task-space consistency (Ltask contrast). injection allows the task and embodiment signals to steer the video generation at multiple featu… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of cross-embodiment video editing. Given a source egocentric human video (top row) and a target robot end-effector (top left), we compare the synthesized videos from our method against the VACE and Phantom baselines across two manipulation tasks: Grasping and Pouring. The general-purpose VACE model struggles to generate the correct morphology, often inpainting a generic humanoid hand… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the disentangled latent spaces learned with our dual contrastive objective. The t-SNE plots (left three) show clear clustering of different tasks and embodiments into distinct groups. The correlation matrix (right) quantitatively confirms the disentanglement, showing high intra-space similarity (red diagonal blocks) and near-zero correlation between task and embodiment representations (blu… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of cross-embodiment video editing on the ‘grasping a plastic bottle’ task. A. CLUB Variational Model Details To implement the mutual information minimization via the CLUB estimator, we employ a variational approximation qϕ(zemb|ztask) parameterized by a Multi-Layer Perceptron (MLP). This network consists of three linear layers with GELU activations (and a Tanh output for the log-vari… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of cross-embodiment video editing on the ‘picking up a black box’ task view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of cross-embodiment video editing on the ‘pouring’ task. 14 view at source ↗
Figure 8
Figure 8. Figure 8: Visual ablation study of the dual contrastive objective on the ‘grasping’ task. Removing the dual contrastive objective leads to a clear degradation in generation quality, evidenced by the blurred interaction between the robot hand and the object. Our full model produces a cleaner and more coherent result, confirming that our explicit latent space regularization is critical for synthesizing high-fidelity i… view at source ↗
Figure 9
Figure 9. Figure 9: Visual ablation study of the dual contrastive objective on the ‘picking’ task. 15 view at source ↗
Figure 10
Figure 10. Figure 10: T-SNE ablation study of the disentangle objective. 1 0 1 2 3 4 5 6 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 Task Space pouring grasping picking 6 5 4 3 2 1 0 1 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Embodiment Space human robotic 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 15 10 5 0 5 10 15 Combined Space Task Embodiment view at source ↗
Figure 11
Figure 11. Figure 11: T-SNE ablation study of the task contrast objective. 9 8 7 6 5 4 3 2 1 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Task Space picking pouring grasping 1 0 1 2 3 3 2 1 0 1 2 3 Embodiment Space human robotic 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 Combined Space Task Embodiment view at source ↗
Figure 12
Figure 12. Figure 12: T-SNE ablation study of the embodiment contrast objective. 16 view at source ↗
Figure 13
Figure 13. Figure 13: T-SNE ablation study of the dual objective. 17 view at source ↗
read the original abstract

Learning robotic manipulation from human videos is a promising solution to the data bottleneck in robotics, but the distribution shift between humans and robots remains a critical challenge. Existing approaches often produce entangled representations, where task-relevant information is coupled with human-specific kinematics, limiting their adaptability. We propose a generative framework for cross-embodiment video editing that directly addresses this by learning explicitly disentangled task and embodiment representations. Our method factorizes a demonstration video into two orthogonal latent spaces by enforcing a dual contrastive objective: it minimizes mutual information between the spaces to ensure independence while maximizing intra-space consistency to create stable representations. A parameter-efficient adapter injects these latent codes into a frozen video diffusion model, enabling the synthesis of a coherent robot execution video from a single human demonstration, without requiring paired cross-embodiment data. Experiments show our approach generates temporally consistent and morphologically accurate robot demonstrations, offering a scalable solution to leverage internet-scale human video for robot learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a generative framework for cross-embodiment video editing to address the distribution shift between human and robot videos in robotic manipulation learning. It factorizes demonstration videos into two orthogonal latent spaces (task and embodiment) by applying a dual contrastive objective that minimizes mutual information between the spaces for independence while maximizing intra-space consistency. A parameter-efficient adapter then injects these latent codes into a frozen video diffusion model, allowing synthesis of coherent robot execution videos from single human demonstrations without requiring paired cross-embodiment data. The abstract asserts that experiments produce temporally consistent and morphologically accurate robot videos.

Significance. If the disentanglement and editing pipeline hold, the work could be significant for robotics by enabling scalable use of internet-scale human videos for robot learning without paired data or large robot datasets. The parameter-efficient adapter and frozen diffusion model approach is practically attractive for efficiency. However, the absence of any quantitative validation in the provided description limits assessment of whether these benefits are realized.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'experiments show our approach generates temporally consistent and morphologically accurate robot demonstrations' is unsupported by any metrics (e.g., temporal consistency scores, morphological error measures), baselines, ablations, or implementation details. Without this evidence, the soundness of the core contribution cannot be evaluated.
  2. [Method] Method description (dual contrastive objective): The factorization into task and embodiment spaces relies solely on minimizing mutual information between latents and maximizing intra-space consistency on unpaired human/robot videos. No explicit cross-embodiment invariance signal exists (no paired examples of the same task under both embodiments), so it is unclear whether the resulting codes will be invariant to kinematics changes; this directly undermines the guarantee that swapping embodiment codes yields morphologically correct robot motion.
minor comments (1)
  1. [Abstract and Method] The abstract and method overview would benefit from explicit equations for the dual contrastive losses and the adapter injection mechanism to clarify the implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the two major comments point by point below, outlining the revisions we will implement to improve clarity and evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'experiments show our approach generates temporally consistent and morphologically accurate robot demonstrations' is unsupported by any metrics (e.g., temporal consistency scores, morphological error measures), baselines, ablations, or implementation details. Without this evidence, the soundness of the core contribution cannot be evaluated.

    Authors: We agree that the abstract claim requires explicit quantitative backing for proper evaluation. The full manuscript presents qualitative video results and initial consistency checks, but to directly address this, we will revise the abstract to reference specific metrics (e.g., optical-flow-based temporal consistency and keypoint-based morphological error) and add a dedicated experiments subsection with baselines, ablations, and implementation details. These changes will be incorporated in the revised version. revision: yes

  2. Referee: [Method] Method description (dual contrastive objective): The factorization into task and embodiment spaces relies solely on minimizing mutual information between latents and maximizing intra-space consistency on unpaired human/robot videos. No explicit cross-embodiment invariance signal exists (no paired examples of the same task under both embodiments), so it is unclear whether the resulting codes will be invariant to kinematics changes; this directly undermines the guarantee that swapping embodiment codes yields morphologically correct robot motion.

    Authors: The dual contrastive objective is intended to induce the required invariance implicitly from unpaired data. Minimizing mutual information between the two latent spaces across mixed human and robot videos encourages the task latent to discard embodiment-specific kinematics, while maximizing intra-task consistency pulls representations of the same task together regardless of embodiment. This contrastive signal, applied over diverse unpaired examples, enables the subsequent swapping to produce morphologically accurate outputs. We will expand the method section with additional intuition, a worked example of the loss terms, and supporting ablation analysis to make this mechanism explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents a generative framework that factorizes videos into task and embodiment latents via an explicitly designed dual contrastive objective (MI minimization plus intra-space consistency maximization) followed by adapter-based injection into a frozen diffusion model. This is a procedural definition of the method rather than a derivation in which any claimed result or prediction reduces by construction to its own inputs, a fitted parameter, or a self-citation chain. No load-bearing self-citations, uniqueness theorems, or renamings of known results appear in the provided text. The central claims rest on the empirical behavior of the proposed losses on unpaired data, which is an independent modeling choice and not tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that task and embodiment factors are separable via mutual-information minimization in latent space and that a small adapter can faithfully recombine them inside a frozen diffusion model.

free parameters (1)
  • adapter parameters
    The parameter-efficient adapter is trained to map the disentangled latents into the diffusion model; its weights are fitted to data.
axioms (1)
  • domain assumption Task-relevant information and embodiment-specific kinematics can be represented as independent orthogonal latent factors
    This independence is enforced by the dual contrastive objective that minimizes mutual information between the two spaces.

pith-pipeline@v0.9.0 · 5484 in / 1288 out tokens · 84387 ms · 2026-05-07T15:40:46.676125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 39 canonical work pages · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Screwmimic: Bimanual imitation from hu- man videos with screw space projection.arXiv preprint arXiv:2405.03666,

    Bahety, A., Mandikal, P., Abbatematteo, B., and Mart ´ın- Mart´ın, R. Screwmimic: Bimanual imitation from hu- man videos with screw space projection.arXiv preprint arXiv:2405.03666,

  3. [3]

    Hot3d: Hand and object tracking in 3d from egocentric multi-view videos, 2025

    Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Han, S., Zhang, F., Zhang, L., Fountain, J., Miller, E., Basol, S., et al. Hot3d: Hand and object tracking in 3d from egocen- tric multi-view videos.arXiv preprint arXiv:2411.19167,

  4. [4]

    Improving image generation with better captions.Computer Sci- ence

    Ouyang, L., Zhuang, J., Lee, J., Guo, Y ., et al. Improving image generation with better captions.Computer Sci- ence. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3): 8, 2023

  5. [4]

    DeepSeek-V3 Technical Report

    URL https://arxiv.org/abs/2412.19437. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), ...

  6. [5]

    Video depth anything: Consistent depth estimation for super-long videos

    Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., and Kang, B. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 22831–22840, 2025

  7. [5]

    Gigahands: A massive annotated dataset of bimanual hand activities.arXiv preprint arXiv:2412.04244,

    Fu, R., Zhang, D., Jiang, A., Fu, W., Funk, A., Ritchie, D., and Sridhar, S. Gigahands: A massive annotated dataset of bimanual hand activities.arXiv preprint arXiv:2412.04244,

  8. [6]

    Club: a contrastive log-ratio upper bound of mutual information

    Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., and Carin, L. Club: a contrastive log-ratio upper bound of mutual information. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

  9. [6]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

    Hoque, R., Huang, P., Yoon, D. J., Sivapurapu, M., and Zhang, J. Egodex: Learning dexterous manipula- tion from large-scale egocentric video.arXiv preprint arXiv:2505.11709,

  10. [7]

    The trimmed iterative closest point algorithm

    Chetverikov, D., Svirko, D., Stepanov, D., and Krsek, P. The trimmed iterative closest point algorithm. In2002 Inter- national Conference on Pattern Recognition, volume 3, pp. 545–548. IEEE, 2002

  11. [7]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Hu, Y ., Guo, Y ., Wang, P., Chen, X., Wang, Y .-J., Zhang, J., Sreenath, K., Lu, C., and Chen, J. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803,

  12. [8]

    CoRR , volume =

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y ., and Liu, Y . Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598,

  13. [9]

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019

  14. [9]

    Egomimic: Scaling imitation learning via egocentric video, 2024

    9 Submission and Formatting Instructions for ICML 2026 Kareer, S., Patel, D., Punamiya, R., Mathur, P., Cheng, S., Wang, C., Hoffman, J., and Xu, D. Egomimic: Scal- ing imitation learning via egocentric video, 2024.URL https://arxiv. org/abs/2410.24221. Kim, H., Kang, J., Kang, H., Cho, M., Kim, S. J., and Lee, Y . Uniskill: Imitating human videos via cro...

  15. [10]

    J., and Hilliges, O

    Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M. J., and Hilliges, O. Arctic: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12943–12954, 2023

  16. [10]

    Ko, P.-C., Mao, J., Du, Y ., Sun, S.-H., and Tenenbaum, J. B. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576,

  17. [11]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    URLhttps://arxiv.org/abs/2506.15742. Lepert, M., Fang, J., and Bohg, J. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025a. Lepert, M., Fang, J., and Bohg, J. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025b. Li, G., Lyu, Y ., Liu, Z., Hou, C., ...

  18. [12]

    The” something something” video database for learning and evaluating visual com- mon sense

    Goyal, R., Ebrahimi Kahou, S., Michalski, V ., Materzynska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. The” something something” video database for learning and evaluating visual com- mon sense. InProceedings of the IEEE international conference on computer vision, pp. 5842–5850, 2017

  19. [12]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

  20. [13]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pp. 18995–19012, 2022

  21. [13]

    Nasiriany, S

    Nasiriany, S., Kirmani, S., Ding, T., Smith, L., Zhu, Y ., Driess, D., Sadigh, D., and Xiao, T. Rt-affordance: Af- fordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

  22. [14]

    Oord, A. v. d., Li, Y ., and Vinyals, O. Representation learn- ing with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

  23. [15]

    R+ x: Retrieval and execution from everyday human videos

    Papagiannis, G., Di Palo, N., Vitiello, P., and Johns, E. R+ x: Retrieval and execution from everyday human videos. arXiv preprint arXiv:2407.12957,

  24. [16]

    Vbench: Compre- hensive benchmark suite for video generative models

    Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., et al. Vbench: Compre- hensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818, 2024

  25. [16]

    Learning to transfer human hand skills for robot manipulations.arXiv preprint arXiv:2501.04169,

    Park, S., Lee, S., Choi, M., Lee, J., Kim, J., Kim, J., and Joo, H. Learning to transfer human hand skills for robot manipulations.arXiv preprint arXiv:2501.04169,

  26. [17]

    Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv e-prints, pp

    Jang, J., Ye, S., Lin, Z., Xiang, J., Bjorck, J., Fang, Y ., Hu, F., Huang, S., Kundalia, K., Lin, Y .-C., et al. Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv e-prints, pp. arXiv–2505, 2025

  27. [17]

    Humanoid policy˜ human policy,

    Qiu, R.-Z., Yang, S., Cheng, X., Chawla, C., Li, J., He, T., Yan, G., Yoon, D. J., Hoque, R., Paulsen, L., et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441,

  28. [18]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ¨adle, R., Rolland, C., Gustafson, L., et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

  29. [19]

    Videoanydoor: High-fidelity video object insertion with precise motion control.arXiv preprint arXiv:2501.01427,

    10 Submission and Formatting Instructions for ICML 2026 Tu, Y ., Luo, H., Chen, X., Ji, S., Bai, X., and Zhao, H. Videoanydoor: High-fidelity video object insertion with precise motion control.arXiv preprint arXiv:2501.01427,

  30. [20]

    J., and Lee, Y

    Kim, H., Kang, J., Kang, H., Cho, M., Kim, S. J., and Lee, Y . Uniskill: Imitating human videos via cross-embodiment skill representations.arXiv preprint arXiv:2505.08787, 2025

  31. [20]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

  32. [21]

    Mimicplay: Long- horizon imitation learning by watching hu- man play

    Wang, C., Fan, L., Sun, J., Zhang, R., Fei-Fei, L., Xu, D., Zhu, Y ., and Anandkumar, A. Mimicplay: Long- horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422,

  33. [22]

    Motion inversion for video customization

    Wang, L., Mai, Z., Shen, G., Liang, Y ., Tao, X., Wan, P., Zhang, D., Li, Y ., and Chen, Y .-C. Motion inversion for video customization. InProceedings of the Special Inter- est Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pp. 1–12, 2025a. Wang, X., Zhang, S., Tang, L., Zhang, Y ., Gao, C., Wang, Y ., and Sang, N. ...

  34. [23]

    Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

    Lepert, M., Fang, J., and Bohg, J. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025a

  35. [23]

    D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping,

    Wei, Y ., Zhang, S., Qing, Z., Yuan, H., Liu, Z., Liu, Y ., Zhang, Y ., Zhou, J., and Shan, H. Dreamvideo: Com- posing your dream videos with customized subject and motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6537– 6549, 2024a. Wei, Z., Xu, Z., Guo, J., Hou, Y ., Gao, C., Cai, Z., Luo, J., and Shao, L. ...

  36. [24]

    Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

    Lepert, M., Fang, J., and Bohg, J. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025b

  37. [24]

    Qwen-Image Technical Report

    URLhttps://arxiv.org/abs/2508.02324. Wu, W., Li, Z., Gu, Y ., Zhao, R., He, Y ., Zhang, D. J., Shou, M. Z., Li, Y ., Gao, T., and Zhang, D. Draganything: Motion control for anything using entity representation. InEuropean Conference on Computer Vision, pp. 331–

  38. [25]

    H2r: A human-to-robot data augmentation for robot pre- training from videos.arXiv preprint arXiv:2505.11920, 2025

    Li, G., Lyu, Y ., Liu, Z., Hou, C., Zhang, J., and Zhang, S. H2r: A human-to-robot data augmentation for robot pre- training from videos.arXiv preprint arXiv:2505.11920, 2025a

  39. [25]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an ex- pert transformer.arXiv preprint arXiv:2408.06072,

  40. [26]

    Unified Video Action Model

    Li, S., Gao, Y ., Sadigh, D., and Song, S. Unified video action model.arXiv preprint arXiv:2503.00200, 2025b

  41. [26]

    Latent Action Pretraining from Videos

    Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y .-W., Lin, B. Y ., et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758,

  42. [27]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion founda- tion model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024a

  43. [27]

    Video2policy: Scaling up manip- ulation tasks in simulation through internet videos.CoRR, abs/2502.09886, 2025

    Ye, W., Liu, F., Ding, Z., Gao, Y ., Rybkin, O., and Abbeel, P. Video2policy: Scaling up manipulation tasks in simulation through internet videos.arXiv preprint arXiv:2502.09886,

  44. [28]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

  45. [28]

    Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory

    Yin, S., Wu, C., Liang, J., Shi, J., Li, H., Ming, G., and Duan, N. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089,

  46. [29]

    One-shot imita- tion learning with invariance matching for robotic ma- nipulation, 2024

    Zhang, X. and Boularias, A. One-shot imitation learning with invariance matching for robotic manipulation.arXiv preprint arXiv:2405.13178,

  47. [30]

    Follow your pose: Pose-guided text-to-video generation using pose-free videos

    Chen, Q. Follow your pose: Pose-guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 4117–4125, 2024

  48. [30]

    Mim- icmotion: High-quality human motion video generation with confidence-aware pose guidance.arXiv preprint arXiv:2406.19680, 2024

    Zhang, Y ., Gu, J., Wang, L.-W., Wang, H., Cheng, J., Zhu, Y ., and Zou, F. Mimicmotion: High-quality human mo- tion video generation with confidence-aware pose guid- ance.arXiv preprint arXiv:2406.19680,

  49. [31]

    G., and Li, Z

    Du, Y ., Thuruthel, T. G., and Li, Z. Towards generalist robot learning from internet video: A survey.Journal of Artificial Intelligence Research, 83, 2025

  50. [31]

    Z., Zhang, D

    11 Submission and Formatting Instructions for ICML 2026 Zhao, R., Gu, Y ., Wu, J. Z., Zhang, D. J., Liu, J.-W., Wu, W., Keppo, J., and Shou, M. Z. Motiondirector: Mo- tion customization of text-to-video diffusion models. In European Conference on Computer Vision, pp. 273–290. Springer,

  51. [32]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T. Z., Kumar, V ., Levine, S., and Finn, C. Learn- ing fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

  52. [33]

    Vision-based manipulation from single human video with open-world object graphs.arXiv preprint arXiv:2405.20321,

    Zhu, Y ., Lim, A., Stone, P., and Zhu, Y . Vision-based manipulation from single human video with open-world object graphs.arXiv preprint arXiv:2405.20321,

  53. [34]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0

    Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 6892–6903. IEEE, 2024

  54. [34]

    12 Submission and Formatting Instructions for ICML 2026 Figure 5.Qualitative comparison of cross-embodiment video editing on the ‘grasping a plastic bottle’ task. A. CLUB Variational Model Details To implement the mutual information minimization via the CLUB estimator, we employ a variational approximation qϕ(zemb|ztask) parameterized by a Multi-Layer Per...

  55. [38]

    Learning transferable visual models from natural language supervision

    Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PmLR, 2021

  56. [39]

    Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21 (140):1–67, 2020

  57. [43]

    W., Myers, V ., Kim, M

    Hansen-Estruch, P., He, A. W., Myers, V ., Kim, M. J., Du, M., et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pp. 1723–1736. PMLR, 2023

  58. [46]

    C., Sheikh, H

    Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to struc- tural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  59. [48]

    Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation

    Liang, X. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation. Advances in Neural Information Processing Systems, 37: 41051–41075, 2024

  60. [50]

    J., Shou, M

    Wu, W., Li, Z., Gu, Y ., Zhao, R., He, Y ., Zhang, D. J., Shou, M. Z., Li, Y ., Gao, T., and Zhang, D. Draganything: Motion control for anything using entity representation. InEuropean Conference on Computer Vision, pp. 331– 348. Springer, 2024

  61. [55]

    A., Shechtman, E., and Wang, O

    Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018

  62. [58]

    Motionpro: A precise motion controller for image-to-video generation

    Zhang, Z., Long, F., Qiu, Z., Pan, Y ., Liu, W., Yao, T., and Mei, T. Motionpro: A precise motion controller for image-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 27957–27967, 2025. 11 Submission and Formatting Instructions for ICML 2026

  63. [59]

    Z., Zhang, D

    Zhao, R., Gu, Y ., Wu, J. Z., Zhang, D. J., Liu, J.-W., Wu, W., Keppo, J., and Shou, M. Z. Motiondirector: Mo- tion customization of text-to-video diffusion models. In European Conference on Computer Vision, pp. 273–290. Springer, 2024