pith. sign in

arxiv: 2605.30263 · v1 · pith:CXKXZ23Rnew · submitted 2026-05-28 · 💻 cs.CV

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Pith reviewed 2026-06-29 08:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords video world modelsautoregressive diffusioncamera controlfew-step distillationcausal forcingreal-time video generationbidirectional to autoregressiveopen-source framework
0
0 comments X

The pith

minWM framework converts bidirectional video diffusion models into camera-controllable autoregressive world models for real-time interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents minWM as an open-source framework that supplies a complete pipeline to transform existing bidirectional text-to-video and text-image-to-video diffusion models into interactive world models. It begins with fine-tuning for camera control, then applies autoregressive diffusion training followed by causal distillation and asymmetric DMD steps to achieve few-step causal generation. A sympathetic reader would care because this directly tackles the practical barriers to turning high-quality but non-causal video models into controllable, low-latency simulators. The work shows the pipeline on two different open backbones and includes ablations on training parameters such as batch size and steps. If the conversion holds, it would let users build and adapt real-time video world models from available checkpoints rather than training everything anew.

Core claim

minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, it first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible across cross-attention-based and MMDiT-style models, and it also supports adapting existing video world models to new data distributions a

What carries the argument

The Causal Forcing / Causal Forcing++ pipeline, which performs autoregressive diffusion training, causal ODE or consistency distillation, and asymmetric DMD to turn bidirectional models into few-step autoregressive generators.

If this is right

  • Existing bidirectional video models gain camera controllability through targeted fine-tuning.
  • Autoregressive rollouts at low latency become available after the distillation steps.
  • The same pipeline works on both cross-attention and MMDiT architectures.
  • Existing video world models can be adapted to new data and latency targets.
  • Practical minimums for batch size and training steps are established through ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular design could support adding other control signals such as object motion or text instructions on top of camera trajectories.
  • The released scripts and checkpoints could serve as a starting point for testing the pipeline on newer or larger video backbones as they appear.
  • Community users might combine this conversion method with different distillation techniques to trade off speed and quality in new ways.

Load-bearing premise

The Causal Forcing pipeline can reliably produce low-latency high-quality autoregressive rollouts from bidirectional models without major quality loss or instability.

What would settle it

A side-by-side comparison on the same prompts showing that videos from the distilled few-step autoregressive models exhibit clear instability such as flickering, drifting camera paths, or loss of visual coherence over dozens of frames compared with the original bidirectional model.

Figures

Figures reproduced from arXiv: 2605.30263 by Bokai Yan, Chongxuan Li, Fan Bao, Guande He, Hongzhou Zhu, Jun Zhu, Kaiwen Zheng, Min Zhao, Wenqiang Sun, Xiao Yang, Yimin Chen, Zihan Zhou.

Figure 1
Figure 1. Figure 1: Overview of minWM. minWM is a full-stack pipeline that converts T2V/TI2V foundation models into camera-controllable few-step autoregressive world models, covering data construction, controllable fine￾tuning, AR training, distillation, and low-latency inference. To this end, we present minWM, a full-stack open-source framework for building real-time interactive video world models. Instead of releasing a sin… view at source ↗
Figure 2
Figure 2. Figure 2: Camera-controllable generation with the distilled few-step AR model. The model supports generation under different camera actions, showing that the distillation algorithm effectively preserves the camera controllability of the base model. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of training data on camera-controllable generation. Under our current setup, directly training with SpatialVid did not yet yield reliable camera-controllable generation. We therefore construct datasets with effectively ground-truth camera trajectories, either through 3D reconstruction and re-rendering or WorldPlay-based generation, which enable the model to learn camera controllability. 8 [PITH_FUL… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of training steps on camera-controllable generation. Using HY1.5 as an example, we observe that camera controllability emerges progressively during training: the model is largely uncontrollable at one to two thousand steps, starts to acquire controllability around five thousand steps, and reaches strong controllability after eight thousand steps. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of batch size on camera-controllable generation. Using Wan2.1 as an example, we find that batch size critically affects camera-control training: batch sizes below 4 often fail to learn controllability, batch size 8 substantially improves controllability but remains unstable, while batch size 16 enables success￾ful training with high controllability. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents minWM, an open-source full-stack framework that converts existing bidirectional T2V/TI2V video diffusion models (e.g., Wan2.1-T2V-1.3B, HY1.5-TI2V-8B) into camera-controllable few-step autoregressive world models. The pipeline consists of camera-controlled fine-tuning of the bidirectional model, followed by AR diffusion training, causal ODE/consistency distillation, and asymmetric DMD to enable low-latency rollout; the work also supports adaptation of existing world models and releases code, checkpoints, and ablations on camera trajectories, controllability steps, and batch size.

Significance. If the described pipeline reliably yields stable low-latency rollouts with limited quality degradation, the contribution would be significant for lowering barriers to interactive video world models. The modular architecture support, open-source release of runnable scripts and checkpoints, and practical ablations constitute concrete strengths that aid reproducibility.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The central claims of low-latency, high-quality autoregressive rollouts without major quality loss rest on unshown quantitative results; no FVD, temporal consistency, perceptual quality, or long-horizon rollout metrics versus the base bidirectional models are reported, leaving the success of the Causal Forcing / Causal Forcing++ pipeline unverified.
  2. [§3 (Causal Forcing pipeline description)] §3 (Causal Forcing pipeline description): The assumption that AR diffusion training plus causal ODE/consistency distillation and asymmetric DMD can convert non-causal bidirectional models into stable causal few-step generators is load-bearing for the main claim, yet the manuscript provides no empirical evidence on compounding artifacts, temporal drift, or stability over extended rollouts.
minor comments (1)
  1. [Project Page and Code Release] Ensure the GitHub repository link includes complete documentation, all referenced training/inference scripts, and the exact checkpoints used for the reported ablations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback emphasizing the need for quantitative validation. We agree that the manuscript as submitted does not contain the requested metrics or stability analyses, and we will revise accordingly to address these gaps.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central claims of low-latency, high-quality autoregressive rollouts without major quality loss rest on unshown quantitative results; no FVD, temporal consistency, perceptual quality, or long-horizon rollout metrics versus the base bidirectional models are reported, leaving the success of the Causal Forcing / Causal Forcing++ pipeline unverified.

    Authors: We agree that direct quantitative comparisons are necessary to substantiate the claims. The current manuscript prioritizes the description of the modular pipeline, open-source release, and ablations on camera trajectories and training hyperparameters, but omits FVD, temporal consistency, perceptual quality, and long-horizon metrics against the base models. In the revised version we will add these evaluations in the Evaluation section, including comparisons that verify the Causal Forcing pipeline. revision: yes

  2. Referee: [§3 (Causal Forcing pipeline description)] §3 (Causal Forcing pipeline description): The assumption that AR diffusion training plus causal ODE/consistency distillation and asymmetric DMD can convert non-causal bidirectional models into stable causal few-step generators is load-bearing for the main claim, yet the manuscript provides no empirical evidence on compounding artifacts, temporal drift, or stability over extended rollouts.

    Authors: We concur that empirical evidence on rollout stability is required. Section 3 currently describes the pipeline components without accompanying experiments on compounding artifacts, temporal drift, or long-horizon behavior. We will incorporate such analyses and any observed limitations into the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering pipeline with no derivations or self-referential predictions

full rationale

The paper presents an open-source framework and modular pipeline for adapting bidirectional video diffusion models into autoregressive world models via fine-tuning, AR training, distillation steps (Causal Forcing / Causal Forcing++), and inference. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The work is self-contained as a reproducible recipe with code release and practical ablations on controllability and batch size; the central claim does not reduce to its own inputs by construction. This matches the default expectation for non-circular engineering papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are extractable from the provided text.

pith-pipeline@v0.9.1-grok · 5901 in / 1097 out tokens · 19562 ms · 2026-06-29T08:20:33.563844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 26 canonical work pages · 14 internal anchors

  1. [1]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr , Joe T aylor , Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  2. [2]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Y aole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233 , 2024

  3. [3]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Y ang, Jiayan T eng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Y ang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: T ext-to-video diffusion models with an expert transformer . arXiv preprint arXiv:2408.06072, 2024

  4. [4]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Y ang Y e, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131 , 2024

  5. [5]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Y ang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Y ang Y ou. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 , 2024

  6. [6]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T eam Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Y ang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 , 2025

  7. [7]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

  8. [8]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, T engfei Wang, and Chunchao Guo. Worldplay: T owards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 , 2025

  9. [9]

    Philip J. Ball, Jakob Bauer , Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter , Agrim Gupta, Kris- tian Holsheimer , Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Y anko Oliveira, Jack Parker-Holder , Frank Perbet, Guy Scully , Jeremy Shar , Stephen Spencer , Omer T ov , Ruben Villegas, Emma Wang, Jessi...

  10. [10]

    Hunyuan-gamecraft-2: Instruction-following interactive game world model

    Junshu T ang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Y ang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction-following interactive game world model. arXiv preprint arXiv:2511.23429, 2025

  11. [11]

    Yume-1.5: A text-controlled interactive world generation model

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, T ong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model. arXiv preprint arXiv:2512.22096 , 2025

  12. [12]

    Vidarc: Embodied video diffusion model for closed-loop control

    Y ao Feng, Chendong Xiang, Xinyi Mao, Hengkai T an, Zuyue Zhang, Shuhe Huang, Kaiwen Zheng, Haitian Liu, Hang Su, and Jun Zhu. Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661, 2025

  13. [13]

    Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

    Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677, 2025

  14. [14]

    Streamavatar: Streaming diffusion models for real-time interactive human avatars

    Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Y ouliang Zhang, Yuan Zhou, Qinglin Lu, et al. Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 , 2025. 11 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

  15. [15]

    Relic: Interactive video world model with long-horizon memory

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Y ang Zhou, Sai Bi, Y annick Hold-Geoffroy , Mike Roberts, Matthew Fisher , Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory . arXiv preprint arXiv:2512.04040, 2025

  16. [16]

    Y an: Foundational interactive video generation

    Deheng Y e, Fangyun Zhou, Jiacheng Lv , Jianqi Ma, Jun Zhang, Junyan Lv , Junyou Li, Minwen Deng, Mingyu Y ang, Qiang Fu, et al. Y an: Foundational interactive video generation. arXiv preprint arXiv:2508.08601 , 2025

  17. [17]

    Pan: A world model for general, interactable, and long-horizon world simulation

    Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Y ang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation. arXiv preprint arXiv:2511.09057, 2025

  18. [18]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Y angyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

  19. [19]

    Motion- stream: Real-time video generation with interactive motion controls

    Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Y an Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motion- stream: Real-time video generation with interactive motion controls. arXiv preprint arXiv:2511.01266 , 2025

  20. [20]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

  21. [21]

    Diffusion adversarial post-training for one-step video generation

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Y ang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. arXiv preprint arXiv:2501.08316 , 2025

  22. [22]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 , 2025

  23. [23]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation. arXiv preprint arXiv:2602.02214, 2026

  24. [24]

    Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

    Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Y an, Xinyuan Li, Xiao Y ang, Chongxuan Li, and Jun Zhu. Causal forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation. arXiv preprint arXiv:2605.15141 , 2026

  25. [25]

    T owards one-step causal video generation via adversarial self-distillation

    Y ongqi Y ang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, and Yu Wu. T owards one-step causal video generation via adversarial self-distillation. arXiv preprint arXiv:2511.01419 , 2025

  26. [26]

    Cameras as relative positional encoding

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors, Advances in Neural Information Processing Systems , volume 38, pages 15984–16009. Curran Associates, Inc., 2025

  27. [27]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi T eng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu T ang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 , 2025

  28. [28]

    Prolificdreamer: High- fidelity and diverse text-to-3d generation with variational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High- fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information process- ing systems, 36:8406–8441, 2023

  29. [29]

    Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36:76525–76546, 2023

  30. [30]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and T aesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6613–6623, 2024

  31. [31]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser , Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller , Harry Saini, Y am Levi, Dominik Lorenz, Axel Sauer , Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning , 2024

  32. [32]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y ang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar , Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 , 2020. 12 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

  33. [33]

    Consistency models

    Y ang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever . Consistency models. 2023

  34. [34]

    Spatialvid: A large-scale video dataset with spatial annotations

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Y outian Lin, Jian Gao, Lin-Zhuo Chen, Y ajie Bao, Yi Zhang, Chang Zeng, Y anxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations. arXiv preprint arXiv:2509.09676, 2025

  35. [35]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi T u, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Y awen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22160–22169, 2024

  36. [36]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Y ang, Zhijie Chen, Xiang Li, Jian Y ang, and Ying T ai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 , 2024. 13