pith. sign in

arxiv: 2606.28237 · v1 · pith:B3TXJYMNnew · submitted 2026-06-26 · 💻 cs.RO

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

Pith reviewed 2026-06-29 04:07 UTC · model grok-4.3

classification 💻 cs.RO
keywords quadrupedal locomotionmotion generationvideo diffusion modelsgenerative priorstracking policiesrobot learningdataset creation
0
0 comments X

The pith

Uni-Mo generates expressive quadruped motions from text prompts using video diffusion models and lifts them into deployable 3D trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to expand quadruped robot behaviors past a handful of gaits by treating data scarcity as a generation task rather than a capture problem. An LLM creates motion prompts, a video diffusion model produces corresponding robot videos, and an Identity Consistency Loss keeps the robot appearance stable so the videos can be lifted into accurate 3D reference trajectories. These trajectories then train tracking policies that run on a real Unitree Go2. The pipeline yields an open dataset of 7,488 motions and reports 96.7 percent hardware success on hundreds of tested behaviors.

Core claim

Reframing quadruped motion synthesis as a video generation problem, an LLM proposes motion prompts, a video diffusion model synthesizes the behaviors, and the generated videos are lifted into 3D reference trajectories used to train tracking policies that deploy on physical hardware without any animal data in the loop.

What carries the argument

The Uni-Mo pipeline that chains LLM prompts, video diffusion synthesis, and Identity Consistency Loss to produce coherent videos that lift reliably into 3D motion references.

If this is right

  • The released dataset of 7,488 language-annotated motions spanning 18.5 hours supports training of many acrobatic and performative behaviors.
  • Tracking policies achieve a 96.7 percent success rate on 392 randomly sampled motions deployed on real hardware.
  • A 97.6 percent success rate holds across the full dataset when tested in simulation.
  • Expressive motions beyond standard gaits become feasible without animal capture or retargeting steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prompt-and-lift pipelines could be adapted to other robot morphologies by changing only the video generation prompts.
  • The open dataset may serve as a starting point for combining generated motions with small amounts of real data to improve robustness.
  • The same generation approach might extend to tasks involving manipulation or multi-robot coordination once suitable video priors exist.

Load-bearing premise

Videos from the diffusion model stay consistent in robot appearance across frames so that accurate 3D trajectories can be extracted and used to train policies that work on real hardware.

What would settle it

If policies trained on the lifted trajectories show deployment success rates well below 90 percent on the physical Unitree Go2, the claim that the generated videos yield usable references would not hold.

Figures

Figures reproduced from arXiv: 2606.28237 by Li Gao, Liu Liu, Yang Cai, Yifei Qian, Youzhi Liu, Ziqiao Li.

Figure 1
Figure 1. Figure 1: Animal-to-robot retargeting is ill-posed due to morphological mismatch (left); Uni-Mo [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Uni-Mo pipeline: natural-language prompts drive an identity-consistent video diffu [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Quad-Imaginarium dominates both motion-capture-derived baselines on the majority of axes (a), with histograms shifted toward higher values and heavier tails (b) and dense joint-angle coverage across the full [−1.5, 1.5]rad range (c). Multi-stage quality filtering. The formulation above makes recovery tractable but does not guar￾antee success on every clip. We apply three sequential gates to discard unfaith… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. Wan-Base shows severe body melting; Wan-FT still has occa [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative real-robot executions of expressive motions from Quad-Imaginarium. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The N=20 appearance bank frames. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: DINOv2 CLS token attention maps on training frames. Top: original images. Middle: [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional generated video frame sequences from Wan-FT + [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Real-robot executions of the six motions from Figure [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Quadruped robots have achieved remarkable locomotion, yet their behavioral repertoire remains confined to a few gaits--far from the expressive, companion-like presence long envisioned for them. Attempts to import the humanoid recipe of large-scale motion data have inherited one tacit assumption: that robot motion must first pass through an animal body, making data collection dependent on cooperative animals, reconstruction fragile across species, and retargeting ill-posed across incompatible morphologies. We propose Uni-Mo, a fully automated pipeline that removes the animal from the loop by reframing data scarcity as a generation problem: an LLM proposes motion prompts, a video diffusion model synthesizes the corresponding robot behaviors, and the generated videos are lifted into 3D reference trajectories used to train tracking policies deployed on a real Unitree Go2. To make naively-drifting generations reliably extractable, we introduce an Identity Consistency Loss that enforces appearance coherence across frames. We release Quad-Imaginarium at https://github.com/GaoLii/Quad-Imaginarium.git, the resulting open-source dataset of 7,488 language-annotated quadruped motions (18.5 hours) spanning acrobatic and performative behaviors. We validate 392 randomly sampled motions on a real Unitree Go2 with a 96.7% deployment success rate, complemented by a 97.6% success rate across the full dataset in simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents Uni-Mo, a fully automated pipeline for scaling expressive quadrupedal motions without animal data. An LLM generates motion prompts, a video diffusion model with a new Identity Consistency Loss synthesizes robot behaviors, the videos are lifted to 3D reference trajectories, and tracking policies are trained and deployed on a real Unitree Go2. The authors release the Quad-Imaginarium dataset (7,488 language-annotated motions) and report 96.7% success on 392 randomly sampled real-robot deployments plus 97.6% success across the full dataset in simulation.

Significance. If the lifting step produces accurate 3D trajectories, the approach would remove a major bottleneck in quadruped motion generation by replacing animal mocap with generative video priors, enabling broader behavioral repertoires. The open release of the 18.5-hour dataset is a concrete community contribution that supports reproducibility and follow-on work.

major comments (1)
  1. [Abstract and validation experiments] Abstract and validation experiments: the reported 96.7% real-robot and 97.6% simulation success rates are presented without any quantitative metrics on 3D lifting accuracy (e.g., 3D joint position error, reprojection error, foot-skate, or kinematic consistency against mocap or synthetic ground truth). This is load-bearing for the central claim, because policies could achieve high success rates even with depth-ambiguous or artifact-laden references if trained to be robust to noise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation of the 3D lifting step. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and validation experiments] Abstract and validation experiments: the reported 96.7% real-robot and 97.6% simulation success rates are presented without any quantitative metrics on 3D lifting accuracy (e.g., 3D joint position error, reprojection error, foot-skate, or kinematic consistency against mocap or synthetic ground truth). This is load-bearing for the central claim, because policies could achieve high success rates even with depth-ambiguous or artifact-laden references if trained to be robust to noise.

    Authors: We agree that direct quantitative metrics on 3D lifting accuracy would provide valuable additional evidence and address the concern that high policy success could arise from robustness to noisy references rather than accurate trajectories. The real-robot success rate serves as an end-to-end measure of trajectory usability, but we acknowledge the referee's point that intermediate lifting quality metrics strengthen the central claim. In the revised manuscript we will add such evaluations, including 3D joint position error, reprojection error, foot-skate, and kinematic consistency, computed against synthetic ground truth derived from the video generation process and available kinematic priors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external validation

full rationale

The paper describes a pipeline of LLM prompt generation, video diffusion synthesis, 3D lifting via Identity Consistency Loss, policy training, and deployment validation on a real Unitree Go2 robot (96.7% success on 392 samples) plus simulation (97.6%). No equations, fitted parameters, or self-citations are presented that reduce any claimed result to an internal definition or input by construction. The reported success rates are measured on physical hardware and independent simulation benchmarks, making the central claims self-contained against external evaluation rather than tautological. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that pre-trained video diffusion models can be adapted to produce robot-specific coherent motion sequences and that 3D lifting from such videos yields usable references; no free parameters or invented physical entities are described in the abstract.

axioms (2)
  • domain assumption Video diffusion models trained on general video data can synthesize coherent quadruped robot motions when conditioned on text prompts
    Central to the data generation step; invoked when the abstract states that the diffusion model synthesizes the corresponding robot behaviors.
  • domain assumption 3D trajectories extracted from generated videos are sufficiently accurate to train deployable tracking policies
    Required for the transition from video generation to real-robot policy training.

pith-pipeline@v0.9.1-grok · 5793 in / 1428 out tokens · 49479 ms · 2026-06-29T04:07:10.242799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 13 canonical work pages

  1. [1]

    Cheng, K

    X. Cheng, K. Shi, A. Agarwal, and D. Pathak. Extreme parkour with legged robots. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450,

  2. [2]

    doi:10.1109/ICRA57147.2024.10610200

  3. [3]

    Kumar, Z

    A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. 07

  4. [4]

    doi:10.15607/RSS.2021.XVII.011

  5. [5]

    Rudin, D

    N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning.arXiv preprint arXiv:2109.11978, 2021

  6. [6]

    Haarnoja, B

    T. Haarnoja, B. Moran, G. Lever, S. H. Huang, et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning.Science Robotics, 9(89), 2024. doi: 10.1126/scirobotics.adi8022

  7. [7]

    Li et al

    T. Li et al. Learning terrain-adaptive locomotion with agile behaviors by imitating animals. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 339–345, 2023. doi:10.1109/IROS55552.2023.10342271

  8. [8]

    Yang et al

    R. Yang et al. Generalized animal imitator: Agile locomotion with versatile motion prior. In Conference on Robot Learning, pages 4631–4650, 2023

  9. [9]

    Cheng, Y

    X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots, 2024. URLhttps://arxiv.org/abs/2402.16796

  10. [10]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans, 2024. URLhttps://arxiv.org/abs/2406.10454

  11. [11]

    X. B. Peng, Y . Guo, L. Halper, S. Levine, and S. Fidler. Ase: large-scale reusable adver- sarial skill embeddings for physically simulated characters.ACM Trans. Graph., 41(4), July

  12. [12]

    doi:10.1145/3528223.3530110

    ISSN 0730-0301. doi:10.1145/3528223.3530110. URLhttps://doi.org/10.1145/ 3528223.3530110

  13. [13]

    M. Ji, X. Peng, F. Liu, J. Li, G. Yang, X. Cheng, and X. Wang. Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

  14. [14]

    T. He, W. Xiao, T. Lin, Z. Luo, Z. Xu, Z. Jiang, J. Kautz, C. Liu, G. Shi, X. Wang, L. Fan, and Y . Zhu. Hover: Versatile neural whole-body controller for humanoid robots.arXiv preprint arXiv:2410.21229, 2024

  15. [15]

    Zhang, W

    C. Zhang, W. Xiao, T. He, and G. Shi. Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

  16. [16]

    Zhuang, S

    Z. Zhuang, S. Yao, and H. Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759, 2024

  17. [17]

    CMU MoCap Dataset

    Carnegie Mellon University. CMU MoCap Dataset. URLhttp://mocap.cs.cmu.edu

  18. [18]

    Mahmood, N

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, Oct. 2019

  19. [19]

    X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, J. Yu, and G. Yu. Executing your commands via motion diffusion in latent space.arXiv preprint arXiv:2212.04048, 2023. 9

  20. [20]

    Loper, N

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: a skinned multi- person linear model. 34(6), Nov. 2015. ISSN 0730-0301. doi:10.1145/2816795.2818013. URLhttps://doi.org/10.1145/2816795.2818013

  21. [21]

    Allshire, H

    A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control. In9th Conference on Robot Learning (CoRL), 2025

  22. [22]

    J. Z. Zhang et al. Slomo: A general system for legged robot motion imitation from casual videos.IEEE Robotics and Automation Letters, 8:7154–7161, 2023. doi:10.1109/LRA.2023. 3313937

  23. [23]

    Joska, L

    D. Joska, L. Clark, N. Muramatsu, R. Jericevich, F. Nicolls, A. Mathis, M. W. Mathis, and A. Patel. Acinoset: A 3d pose estimation dataset and baseline models for cheetahs in the wild. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13901– 13908, 2021. doi:10.1109/ICRA48506.2021.9561338

  24. [24]

    Z. Wang, S. Chen, L. Mo, X. Gao, Y . Shen, L. Ding, and W. Liang. Dogmo: A large-scale multi-view rgb-d dataset for 4d canine motion recovery, 2025. URLhttps://arxiv.org/ abs/2510.24117

  25. [25]

    H. Yu, Y . Xu, J. Zhang, W. Zhao, Z. Guan, and D. Tao. Ap-10k: A benchmark for animal pose estimation in the wild.arXiv preprint arXiv:2108.12617, 2021

  26. [26]

    L. Zhao, Z. Luo, Y . Han, J. Zhang, Y . Chen, Y . Liu, and P. Lu. Learning aggressive animal locomotion skills for quadrupedal robots solely from monocular videos.npj Robotics, 3(1):32, 2025

  27. [27]

    Chane-Sane, C

    E. Chane-Sane, C. Roux, O. Stasse, and N. Mansard. Reinforcement learning from wild animal videos, 2024. URLhttps://arxiv.org/abs/2412.04273

  28. [28]

    Zuffi, A

    S. Zuffi, A. Kanazawa, D. W. Jacobs, and M. J. Black. 3D Menagerie: Modeling the 3D Shape and Pose of Animals . In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5524–5532, Los Alamitos, CA, USA, July 2017. IEEE Computer Society. doi:10.1109/CVPR.2017.586. URLhttps://doi.ieeecomputersociety.org/ 10.1109/CVPR.2017.586

  29. [29]

    Wan Team, A. Wang, B. Ai, B. Wen, C. Mao, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  30. [30]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  31. [31]

    Team Seedance, D. Chen, L. Chen, X. Chen, Y . Chen, et al. Seedance 2.0: Advancing video generation for world complexity, 2026. URLhttps://arxiv.org/abs/2604.14148

  32. [32]

    L. Mou, J. Lei, C. Wang, L. Liu, and K. Daniilidis. Dimo: Diverse 3d motion generation for arbitrary objects, 2025. URLhttps://arxiv.org/abs/2511.07409

  33. [33]

    J. Mao, S. He, H.-N. Wu, Y . You, S. Sun, Z. Wang, Y . Bao, H. Chen, L. Guibas, V . Guizilini, H. Zhou, and Y . Wang. Robot learning from a physical world model, 2025. URLhttps: //arxiv.org/abs/2511.07416

  34. [34]

    K. Ye, Y . Wu, S. Hu, J. Li, M. Liu, Y . Chen, and R. Huang. Gen2real: Towards demo-free dexterous manipulation by harnessing generated video, 2025. URLhttps://arxiv.org/ abs/2509.14178

  35. [35]

    J. Ni, Z. Wang, W. Lin, A. Bar, Y . LeCun, T. Darrell, J. Malik, and R. Herzig. From generated human videos to physically plausible robot trajectories, 2025. URLhttps://arxiv.org/ abs/2512.05094. 10

  36. [36]

    S. Wu, F. Teng, H. Shi, Q. Jiang, K. Luo, K. Wang, and K. Yang. Quadreamer: Controllable panoramic video generation for quadruped robots, 2025. URLhttps://arxiv.org/abs/ 2508.02512

  37. [37]

    Y . Tang, Y . Lou, P. Han, H. Song, X. Ye, D. Wang, and B. Zhao. Trajectory conditioned cross-embodiment skill transfer, 2025. URLhttps://arxiv.org/abs/2510.07773

  38. [38]

    H. Li, L. Sun, Y . Hu, D. Ta, J. Barry, G. Konidaris, and J. Fu. Novaflow: Zero-shot manip- ulation via actionable flow from generated videos, 2025. URLhttps://arxiv.org/abs/ 2510.08568

  39. [39]

    Albaba, C

    M. Albaba, C. Li, M. Diomataris, O. Taheri, A. Krause, and M. J. Black. Nil: No-data imitation learning by leveraging pre-trained video diffusion models.arXiv preprint arXiv:2503.10626, 2025

  40. [40]

    X. B. Peng, E. Coumans, T. Zhang, T.-W. Lee, J. Tan, and S. Levine. Learning agile robotic locomotion skills by imitating animals. InRobotics: Science and Systems, 2020. doi:10.15607/ rss.2020.xvi.064

  41. [41]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (TOG), 40(4): 1–20, 2021. doi:10.1145/3450626.3459670

  42. [42]

    Escontrela et al

    A. Escontrela et al. Adversarial motion priors make good substitutes for complex reward functions. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 25–32, 2022. doi:10.1109/IROS47612.2022.9981973

  43. [43]

    Bohez et al

    S. Bohez et al. Imitate and repurpose: Learning reusable robot movement skills from human and animal behaviors. InarXiv preprint arXiv:2203.17138, 2022

  44. [44]

    T. Yoon, D. Kang, S. Kim, M. Ahn, S. Coros, and S. Choi. Spatio-temporal motion retargeting for quadruped robots.IEEE Transactions on Robotics, 41:5471–5490, 2024. doi:10.1109/ TRO.2025.3600123

  45. [45]

    Huang et al

    X. Huang et al. Diffuseloco: Real-time legged locomotion control with diffusion from offline datasets.arXiv preprint arXiv:2404.19264, 2024

  46. [46]

    Y . Chen, L. Zhao, J. Ma, and P. Lu. In-between motion generation based multi-style quadruped robot locomotion, 2025. URLhttps://arxiv.org/abs/2507.23053

  47. [47]

    L. Gao, F. Yang, J. Chen, L. Liu, Y . Zheng, Y . Cai, and Z. Li. Quadfm: Foundational text- driven quadruped motion dataset for generation and control, 2026. URLhttps://arxiv. org/abs/2603.24021

  48. [48]

    M. Wang, Z. Wang, H. Xu, K. Hu, Z. Wang, and W. Kang. T2qrm: Text-driven quadruped robot motion generation. InProceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia ’24, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400712739. doi:10.1145/3696409.3700230. URLhttps://doi.org/10.1145/ 3696409.3700230

  49. [49]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion, 2025. URLhttps: //arxiv.org/abs/2508.08241

  50. [50]

    Gemini Team, R. Anil, S. Borgeaud, J.-B. Alayrac, et al. Gemini: A family of highly capable multimodal models, 2025. URLhttps://arxiv.org/abs/2312.11805. 11

  51. [51]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without sup...

  52. [52]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low- rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106. 09685

  53. [53]

    Y . Xu, J. Zhang, Q. Zhang, and D. Tao. Vitpose: Simple vision transformer baselines for human pose estimation, 2022. URLhttps://arxiv.org/abs/2204.12484

  54. [54]

    M. J. Chong and D. Forsyth. Effectively unbiased fid and inception score and where to find them, 2020. URLhttps://arxiv.org/abs/1911.07023

  55. [55]

    Unterthiner, S

    T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges, 2019. URLhttps:// arxiv.org/abs/1812.01717

  56. [56]

    Comanici, E

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

  57. [57]

    URLhttps://arxiv.org/abs/2507.06261

  58. [58]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

  59. [59]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026– 5033, 2012. doi:10.1109/IROS.2012.6386109

  60. [60]

    Zakka, Q

    K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel. mjlab: A lightweight framework for gpu-accelerated robot learning, 2026. URLhttps://arxiv.org/abs/2601. 22074. A Implementation Details A.1 Video Generation Model Fine-tuning We fine-tune Wan2.2-I2V-A14B [26] using LoRA [49] with rank 32, targeting the attention projec- tion matrices and fee...