pith. machine review for the scientific record. sign in

arxiv: 2508.05635 · v3 · submitted 2025-08-07 · 💻 cs.RO · cs.CV

Recognition: 3 theorem links

· Lean Theorem

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:25 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords video diffusion modelrobotic manipulationembodied AIpolicy learningneural simulationworld modelinstruction following
0
0 comments X

The pith

A single instruction-conditioned video diffusion model unifies policy learning, simulation, and evaluation for robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Genie Envisioner as a platform that folds policy learning, evaluation, and simulation into one video-generative system. Its central model learns to produce videos from instructions, thereby encoding the spatial, temporal, and semantic structure of real robotic interactions inside a shared latent space. A lightweight decoder then turns those latent features into executable action sequences, while an action-conditioned simulator produces rollouts for testing and refinement. The design targets instruction-driven robots that work across different physical bodies with little extra supervision. The authors also supply a benchmark that scores generated videos on visual quality, physical plausibility, and alignment with commands.

Core claim

Genie Envisioner shows that one large-scale instruction-conditioned video diffusion model can capture the dynamics of robotic interactions in a structured latent space and directly support both action inference and neural simulation. GE-Act extracts action trajectories from this space through a flow-matching decoder, enabling generalizable policies across embodiments. GE-Sim generates high-fidelity action-conditioned rollouts for closed-loop development. The unified structure removes the need for separate models for each stage of embodied intelligence.

What carries the argument

GE-Base, the instruction-conditioned video diffusion model that encodes spatial, temporal, and semantic dynamics of robotic interactions inside a structured latent space.

If this is right

  • Policies for new robot embodiments can be obtained with minimal additional supervision by reading actions from the shared latent space.
  • Closed-loop policy improvement becomes possible through repeated high-fidelity neural rollouts without constant physical hardware access.
  • A single model handles visual generation, action planning, and outcome prediction, lowering the engineering overhead for general manipulation systems.
  • Standardized scoring on visual fidelity, physical consistency, and instruction alignment enables direct comparison of future world-model approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The latent space learned from video could support transfer to tasks beyond manipulation, such as navigation or tool use, if the dynamics representation proves sufficiently general.
  • Public release of the model weights and benchmark would let other groups test whether the same video foundation improves data efficiency when combined with real robot trajectories.
  • If the diffusion model’s temporal predictions remain accurate over longer horizons, the platform could reduce reliance on expensive real-world data collection for training.

Load-bearing premise

The video diffusion model must accurately represent real-world physical dynamics so that derived actions succeed on physical robots and simulated rollouts remain reliable.

What would settle it

Deploy GE-Act policies on physical robots performing instructed tasks and compare success rates and motion accuracy against baselines trained on real data; separately compare GE-Sim rollouts frame-by-frame with actual camera recordings from the same executions.

read the original abstract

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. GE-Base is presented as a large-scale instruction-conditioned video diffusion model that captures spatial, temporal, and semantic dynamics of robotic interactions in a structured latent space. GE-Act maps these latents to executable action trajectories via a lightweight flow-matching decoder for precise policy inference across embodiments. GE-Sim functions as an action-conditioned neural simulator for high-fidelity closed-loop rollouts. The platform includes EWMBench, a benchmark suite for visual fidelity, physical consistency, and instruction-action alignment. The work claims this establishes a scalable foundation for instruction-driven embodied intelligence, with public release of code, models, and benchmarks.

Significance. If the unshown quantitative results confirm the claims, the work would offer a significant contribution by unifying video-based world modeling with direct action decoding and simulation in robotics. This could reduce reliance on separate physics engines or task-specific policies and enable more generalizable manipulation across diverse embodiments with minimal supervision. The public release of models and EWMBench would further support reproducibility and community progress in generative world models for embodied AI.

major comments (1)
  1. [Abstract and GE-Base/GE-Act/GE-Sim descriptions] The central claim that GE-Base produces latents sufficiently accurate in spatial, temporal, and physical respects to support GE-Act action recovery and GE-Sim closed-loop rollouts is load-bearing but unsupported. The manuscript describes the architecture and EWMBench metrics (visual fidelity, physical consistency, instruction-action alignment) yet reports no concrete predictive quantities such as per-frame 3D keypoint error, contact-force consistency, or success-rate degradation over multi-step horizons on held-out real-robot trajectories.
minor comments (1)
  1. [Abstract] The abstract packs multiple component descriptions into a single paragraph; splitting the component roles into separate sentences would improve readability without altering content.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about insufficient quantitative validation for the latent representations is well-taken, and we will strengthen the manuscript by adding the requested metrics.

read point-by-point responses
  1. Referee: The central claim that GE-Base produces latents sufficiently accurate in spatial, temporal, and physical respects to support GE-Act action recovery and GE-Sim closed-loop rollouts is load-bearing but unsupported. The manuscript describes the architecture and EWMBench metrics (visual fidelity, physical consistency, instruction-action alignment) yet reports no concrete predictive quantities such as per-frame 3D keypoint error, contact-force consistency, or success-rate degradation over multi-step horizons on held-out real-robot trajectories.

    Authors: We agree that the current manuscript does not report the specific predictive quantities mentioned. While EWMBench evaluates visual fidelity, physical consistency, and instruction-action alignment at the benchmark level, it does not include the per-frame 3D keypoint errors, contact-force consistency measures, or multi-step success-rate degradation curves on held-out real-robot trajectories that would directly substantiate the load-bearing claim about latent accuracy. In the revised version we will add a dedicated quantitative analysis subsection (with accompanying tables and figures) reporting these exact metrics computed on held-out real-robot data for both GE-Act policy rollouts and GE-Sim closed-loop simulations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript describes an architectural integration of a video diffusion model (GE-Base) with downstream modules for action decoding (GE-Act) and simulation (GE-Sim), but contains no equations, derivations, or quantitative predictions that reduce claimed performance to fitted parameters or self-referential inputs by construction. Components are presented as distinct extensions of standard generative techniques, with claims supported by external benchmarks (EWMBench) rather than internal tautologies. No self-definitional loops, fitted-input predictions, load-bearing self-citations, or smuggled ansatzes appear in the provided text, leaving the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the platform is described at the level of standard video diffusion and flow-matching components without detailing training losses, latent dimensions, or physical priors.

pith-pipeline@v0.9.0 · 5518 in / 1182 out tokens · 23860 ms · 2026-05-15T21:25:20.398637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  2. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  5. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  6. JailWAM: Jailbreaking World Action Models in Robot Control

    cs.RO 2026-04 unverdicted novelty 7.0

    JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

  7. MotuBrain: An Advanced World Action Model for Robot Control

    cs.RO 2026-04 unverdicted novelty 6.0

    MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...

  8. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  9. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  10. Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.

  11. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  12. WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.

  13. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  14. RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

    cs.CV 2026-03 unverdicted novelty 6.0

    A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.

  15. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  16. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  17. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  18. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  19. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 18 Pith papers · 20 internal anchors

  1. [1]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743,

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, D. Dworakowski, J. Fan, M. Fenzi, F. Ferroni, S. Fidler, D. Fox, S. Ge, Y . Ge, J. Gu, S. Gururani, E. He, J. Huang, J. Huffman, P. Jannaty, J. Jin, S. W. Kim, G. Klár, G. Lam, S. Lan, L. Leal-Taixe, A. Li, Z. Li, C.-H. Lin, T.-Y . Lin, H. Ling, M.-Y...

  3. [3]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691,

  4. [4]

    URLhttps://github .com/Genesis-Embodied-AI/Genesis. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

  8. [8]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  9. [9]

    Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Huang, S. Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric ...

  10. [10]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Q. Liang, Z. Li, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088,

  11. [11]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-E: An embodied multimodal language model.arXiv preprint arXiv:2023.03378,

  12. [12]

    Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

    24 F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568,

  13. [13]

    World Models

    D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122,

  14. [14]

    LTX-Video: Realtime Video Latent Diffusion

    Y . HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, P. Panet, S. Weissbuch, V . Kulikov, Y . Bitterman, Z. Melumian, and O. Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

  15. [15]

    E. S. Hu, K. Ahn, Q. Liu, H. Xu, M. Tomar, A. Langford, D. Jayaraman, A. Lamb, and J. Langford. Learning to achieve goals with belief state transformers.arXiv preprint arXiv:2410.23506,

  16. [16]

    Huang, Z

    S. Huang, Z. Jiang, H. Dong, Y . Qiao, P. Gao, and H. Li. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model.arXiv preprint arXiv:2305.11176,

  17. [17]

    Huang, L

    S. Huang, L. Chen, P. Zhou, S. Chen, Z. Jiang, Y . Hu, Y . Liao, P. Gao, H. Li, M. Yao, et al. Enerverse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895,

  18. [18]

    Huang, Y

    Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024a. Z. Huang, F. Zhang, X. Xu, Y . He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y . Jiang, et al....

  19. [19]

    J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705,

  20. [20]

    Jiang, S

    Y . Jiang, S. Chen, S. Huang, L. Chen, P. Zhou, Y . Liao, X. He, C. Liu, H. Li, M. Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723,

  21. [21]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

  22. [22]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470,

  23. [23]

    ISBN 0262133962. F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y . Cheng, D. Li, Y . Qiao, and P. Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

  24. [24]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

  25. [25]

    URLhttps://openai.com/sora/. M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  26. [26]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

  27. [27]

    K. Sun, K. Huang, X. Liu, Y . Wu, Z. Xu, Z. Li, and X. Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation.arXiv preprint arXiv:2407.14505,

  28. [28]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

  29. [29]

    H. Yue, S. Huang, Y . Liao, S. Chen, P. Zhou, L. Chen, M. Yao, and G. Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694,

  30. [30]

    URLhttps://github.com/hpcaitech/Open-Sora. Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278,