minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Bokai Yan; Chongxuan Li; Fan Bao; Guande He; Hongzhou Zhu; Jun Zhu; Kaiwen Zheng; Min Zhao; Wenqiang Sun; Xiao Yang

arxiv: 2605.30263 · v1 · pith:CXKXZ23Rnew · submitted 2026-05-28 · 💻 cs.CV

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Min Zhao , Hongzhou Zhu , Bokai Yan , Zihan Zhou , Yimin Chen , Wenqiang Sun , Kaiwen Zheng , Guande He

show 4 more authors

Xiao Yang Chongxuan Li Fan Bao Jun Zhu

This is my paper

Pith reviewed 2026-06-29 08:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords video world modelsautoregressive diffusioncamera controlfew-step distillationcausal forcingreal-time video generationbidirectional to autoregressiveopen-source framework

0 comments

The pith

minWM framework converts bidirectional video diffusion models into camera-controllable autoregressive world models for real-time interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents minWM as an open-source framework that supplies a complete pipeline to transform existing bidirectional text-to-video and text-image-to-video diffusion models into interactive world models. It begins with fine-tuning for camera control, then applies autoregressive diffusion training followed by causal distillation and asymmetric DMD steps to achieve few-step causal generation. A sympathetic reader would care because this directly tackles the practical barriers to turning high-quality but non-causal video models into controllable, low-latency simulators. The work shows the pipeline on two different open backbones and includes ablations on training parameters such as batch size and steps. If the conversion holds, it would let users build and adapt real-time video world models from available checkpoints rather than training everything anew.

Core claim

minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, it first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible across cross-attention-based and MMDiT-style models, and it also supports adapting existing video world models to new data distributions a

What carries the argument

The Causal Forcing / Causal Forcing++ pipeline, which performs autoregressive diffusion training, causal ODE or consistency distillation, and asymmetric DMD to turn bidirectional models into few-step autoregressive generators.

If this is right

Existing bidirectional video models gain camera controllability through targeted fine-tuning.
Autoregressive rollouts at low latency become available after the distillation steps.
The same pipeline works on both cross-attention and MMDiT architectures.
Existing video world models can be adapted to new data and latency targets.
Practical minimums for batch size and training steps are established through ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular design could support adding other control signals such as object motion or text instructions on top of camera trajectories.
The released scripts and checkpoints could serve as a starting point for testing the pipeline on newer or larger video backbones as they appear.
Community users might combine this conversion method with different distillation techniques to trade off speed and quality in new ways.

Load-bearing premise

The Causal Forcing pipeline can reliably produce low-latency high-quality autoregressive rollouts from bidirectional models without major quality loss or instability.

What would settle it

A side-by-side comparison on the same prompts showing that videos from the distilled few-step autoregressive models exhibit clear instability such as flickering, drifting camera paths, or loss of visual coherence over dozens of frames compared with the original bidirectional model.

Figures

Figures reproduced from arXiv: 2605.30263 by Bokai Yan, Chongxuan Li, Fan Bao, Guande He, Hongzhou Zhu, Jun Zhu, Kaiwen Zheng, Min Zhao, Wenqiang Sun, Xiao Yang, Yimin Chen, Zihan Zhou.

**Figure 1.** Figure 1: Overview of minWM. minWM is a full-stack pipeline that converts T2V/TI2V foundation models into camera-controllable few-step autoregressive world models, covering data construction, controllable finetuning, AR training, distillation, and low-latency inference. To this end, we present minWM, a full-stack open-source framework for building real-time interactive video world models. Instead of releasing a sin… view at source ↗

**Figure 2.** Figure 2: Camera-controllable generation with the distilled few-step AR model. The model supports generation under different camera actions, showing that the distillation algorithm effectively preserves the camera controllability of the base model. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of training data on camera-controllable generation. Under our current setup, directly training with SpatialVid did not yet yield reliable camera-controllable generation. We therefore construct datasets with effectively ground-truth camera trajectories, either through 3D reconstruction and re-rendering or WorldPlay-based generation, which enable the model to learn camera controllability. 8 [PITH_FUL… view at source ↗

**Figure 4.** Figure 4: Effect of training steps on camera-controllable generation. Using HY1.5 as an example, we observe that camera controllability emerges progressively during training: the model is largely uncontrollable at one to two thousand steps, starts to acquire controllability around five thousand steps, and reaches strong controllability after eight thousand steps. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of batch size on camera-controllable generation. Using Wan2.1 as an example, we find that batch size critically affects camera-control training: batch sizes below 4 often fail to learn controllability, batch size 8 substantially improves controllability but remains unstable, while batch size 16 enables successful training with high controllability. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

minWM is a practical open-source engineering release that assembles camera fine-tuning and Causal Forcing distillation into a modular pipeline for turning bidirectional video models into low-latency autoregressive ones, with code and checkpoints shipped.

read the letter

This paper's main point is that minWM wires together camera-controllable fine-tuning on bidirectional T2V/TI2V models with AR diffusion training, causal ODE/consistency distillation, and asymmetric DMD to produce few-step autoregressive generators, then releases the full stack on Wan2.1 and HY1.5 backbones plus adaptation support for models like HY-WorldPlay.

What the work does well is the modularity and the actual release. Runnable scripts, checkpoints, documentation, and inference code lower the barrier for anyone who wants to try building interactive video systems. The reported ablations on camera trajectory quality, controllability training steps, and minimal batch size are the sort of concrete engineering details that help others get started.

The soft spot is the thin evidence on whether the distilled models hold up. The abstract describes the pipeline and claims low-latency controllable rollouts but only shows ablations on training choices, with no reported numbers on long-horizon metrics such as FVD, temporal consistency, or quality drop versus the base model. That is exactly where these distillation steps often introduce drift, so the central claim rests on implementation success that is not yet quantified in the provided text.

This is for researchers in generative video and robotics simulation who need a working recipe rather than a new theoretical result. A reader who plans to implement or extend real-time world models will get direct value from the released assets. It deserves serious referee time because the contribution is the integrated, reproducible pipeline and the open implementation, even if the evaluation section needs more rollout data.

Referee Report

2 major / 1 minor

Summary. The paper presents minWM, an open-source full-stack framework that converts existing bidirectional T2V/TI2V video diffusion models (e.g., Wan2.1-T2V-1.3B, HY1.5-TI2V-8B) into camera-controllable few-step autoregressive world models. The pipeline consists of camera-controlled fine-tuning of the bidirectional model, followed by AR diffusion training, causal ODE/consistency distillation, and asymmetric DMD to enable low-latency rollout; the work also supports adaptation of existing world models and releases code, checkpoints, and ablations on camera trajectories, controllability steps, and batch size.

Significance. If the described pipeline reliably yields stable low-latency rollouts with limited quality degradation, the contribution would be significant for lowering barriers to interactive video world models. The modular architecture support, open-source release of runnable scripts and checkpoints, and practical ablations constitute concrete strengths that aid reproducibility.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: The central claims of low-latency, high-quality autoregressive rollouts without major quality loss rest on unshown quantitative results; no FVD, temporal consistency, perceptual quality, or long-horizon rollout metrics versus the base bidirectional models are reported, leaving the success of the Causal Forcing / Causal Forcing++ pipeline unverified.
[§3 (Causal Forcing pipeline description)] §3 (Causal Forcing pipeline description): The assumption that AR diffusion training plus causal ODE/consistency distillation and asymmetric DMD can convert non-causal bidirectional models into stable causal few-step generators is load-bearing for the main claim, yet the manuscript provides no empirical evidence on compounding artifacts, temporal drift, or stability over extended rollouts.

minor comments (1)

[Project Page and Code Release] Ensure the GitHub repository link includes complete documentation, all referenced training/inference scripts, and the exact checkpoints used for the reported ablations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback emphasizing the need for quantitative validation. We agree that the manuscript as submitted does not contain the requested metrics or stability analyses, and we will revise accordingly to address these gaps.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central claims of low-latency, high-quality autoregressive rollouts without major quality loss rest on unshown quantitative results; no FVD, temporal consistency, perceptual quality, or long-horizon rollout metrics versus the base bidirectional models are reported, leaving the success of the Causal Forcing / Causal Forcing++ pipeline unverified.

Authors: We agree that direct quantitative comparisons are necessary to substantiate the claims. The current manuscript prioritizes the description of the modular pipeline, open-source release, and ablations on camera trajectories and training hyperparameters, but omits FVD, temporal consistency, perceptual quality, and long-horizon metrics against the base models. In the revised version we will add these evaluations in the Evaluation section, including comparisons that verify the Causal Forcing pipeline. revision: yes
Referee: [§3 (Causal Forcing pipeline description)] §3 (Causal Forcing pipeline description): The assumption that AR diffusion training plus causal ODE/consistency distillation and asymmetric DMD can convert non-causal bidirectional models into stable causal few-step generators is load-bearing for the main claim, yet the manuscript provides no empirical evidence on compounding artifacts, temporal drift, or stability over extended rollouts.

Authors: We concur that empirical evidence on rollout stability is required. Section 3 currently describes the pipeline components without accompanying experiments on compounding artifacts, temporal drift, or long-horizon behavior. We will incorporate such analyses and any observed limitations into the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering pipeline with no derivations or self-referential predictions

full rationale

The paper presents an open-source framework and modular pipeline for adapting bidirectional video diffusion models into autoregressive world models via fine-tuning, AR training, distillation steps (Causal Forcing / Causal Forcing++), and inference. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The work is self-contained as a reproducible recipe with code release and practical ablations on controllability and batch size; the central claim does not reduce to its own inputs by construction. This matches the default expectation for non-circular engineering papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are extractable from the provided text.

pith-pipeline@v0.9.1-grok · 5901 in / 1097 out tokens · 19562 ms · 2026-06-29T08:20:33.563844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 26 canonical work pages · 14 internal anchors

[1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr , Joe T aylor , Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

2024
[2]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Y aole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233 , 2024

work page arXiv 2024
[3]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Y ang, Jiayan T eng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Y ang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: T ext-to-video diffusion models with an expert transformer . arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Y ang Y e, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Y ang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Y ang Y ou. Open-sora: Democratizing efﬁcient video production for all. arXiv preprint arXiv:2412.20404 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Wan: Open and Advanced Large-Scale Video Generative Models

T eam Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Y ang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, T engfei Wang, and Chunchao Guo. Worldplay: T owards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Philip J. Ball, Jakob Bauer , Frank Belletti, Bethanie Brownﬁeld, Ariel Ephrat, Shlomi Fruchter , Agrim Gupta, Kris- tian Holsheimer , Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Y anko Oliveira, Jack Parker-Holder , Frank Perbet, Guy Scully , Jeremy Shar , Stephen Spencer , Omer T ov , Ruben Villegas, Emma Wang, Jessi...

2025
[10]

Hunyuan-gamecraft-2: Instruction-following interactive game world model

Junshu T ang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Y ang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction-following interactive game world model. arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025
[11]

Yume-1.5: A text-controlled interactive world generation model

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, T ong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model. arXiv preprint arXiv:2512.22096 , 2025

work page arXiv 2025
[12]

Vidarc: Embodied video diffusion model for closed-loop control

Y ao Feng, Chendong Xiang, Xinyi Mao, Hengkai T an, Zuyue Zhang, Shuhe Huang, Kaiwen Zheng, Haitian Liu, Hang Su, and Jun Zhu. Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661, 2025

work page arXiv 2025
[13]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with inﬁnite length. arXiv preprint arXiv:2512.04677, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Streamavatar: Streaming diffusion models for real-time interactive human avatars

Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Y ouliang Zhang, Yuan Zhou, Qinglin Lu, et al. Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 , 2025. 11 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

work page arXiv 2025
[15]

Relic: Interactive video world model with long-horizon memory

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Y ang Zhou, Sai Bi, Y annick Hold-Geoffroy , Mike Roberts, Matthew Fisher , Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory . arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025
[16]

Y an: Foundational interactive video generation

Deheng Y e, Fangyun Zhou, Jiacheng Lv , Jianqi Ma, Jun Zhang, Junyan Lv , Junyou Li, Minwen Deng, Mingyu Y ang, Qiang Fu, et al. Y an: Foundational interactive video generation. arXiv preprint arXiv:2508.08601 , 2025

work page arXiv 2025
[17]

Pan: A world model for general, interactable, and long-horizon world simulation

Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Y ang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation. arXiv preprint arXiv:2511.09057, 2025

work page arXiv 2025
[18]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Y angyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Motion- stream: Real-time video generation with interactive motion controls

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Y an Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motion- stream: Real-time video generation with interactive motion controls. arXiv preprint arXiv:2511.01266 , 2025

work page arXiv 2025
[20]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025
[21]

Diffusion adversarial post-training for one-step video generation

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Y ang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. arXiv preprint arXiv:2501.08316 , 2025

work page arXiv 2025
[22]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation. arXiv preprint arXiv:2602.02214, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Y an, Xinyuan Li, Xiao Y ang, Chongxuan Li, and Jun Zhu. Causal forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation. arXiv preprint arXiv:2605.15141 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

T owards one-step causal video generation via adversarial self-distillation

Y ongqi Y ang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, and Yu Wu. T owards one-step causal video generation via adversarial self-distillation. arXiv preprint arXiv:2511.01419 , 2025

work page arXiv 2025
[26]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors, Advances in Neural Information Processing Systems , volume 38, pages 15984–16009. Curran Associates, Inc., 2025

2025
[27]

MAGI-1: Autoregressive Video Generation at Scale

Hansi T eng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu T ang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Proliﬁcdreamer: High- ﬁdelity and diverse text-to-3d generation with variational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Proliﬁcdreamer: High- ﬁdelity and diverse text-to-3d generation with variational score distillation. Advances in neural information process- ing systems, 36:8406–8441, 2023

2023
[29]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36:76525–76546, 2023

2023
[30]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and T aesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6613–6623, 2024

2024
[31]

Scaling rectiﬁed ﬂow transformers for high-resolution image synthesis

Patrick Esser , Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller , Harry Saini, Y am Levi, Dominik Lorenz, Axel Sauer , Frederic Boesel, et al. Scaling rectiﬁed ﬂow transformers for high-resolution image synthesis. In Forty-ﬁrst international conference on machine learning , 2024

2024
[32]

Score-Based Generative Modeling through Stochastic Differential Equations

Y ang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar , Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 , 2020. 12 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

work page internal anchor Pith review Pith/arXiv arXiv 2011
[33]

Consistency models

Y ang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever . Consistency models. 2023

2023
[34]

Spatialvid: A large-scale video dataset with spatial annotations

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Y outian Lin, Jian Gao, Lin-Zhuo Chen, Y ajie Bao, Yi Zhang, Chang Zeng, Y anxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations. arXiv preprint arXiv:2509.09676, 2025

work page arXiv 2025
[35]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi T u, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Y awen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22160–22169, 2024

2024
[36]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Y ang, Zhijie Chen, Xiang Li, Jian Y ang, and Ying T ai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 , 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr , Joe T aylor , Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

2024

[2] [2]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Y aole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233 , 2024

work page arXiv 2024

[3] [3]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Y ang, Jiayan T eng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Y ang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: T ext-to-video diffusion models with an expert transformer . arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Y ang Y e, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Y ang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Y ang Y ou. Open-sora: Democratizing efﬁcient video production for all. arXiv preprint arXiv:2412.20404 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Wan: Open and Advanced Large-Scale Video Generative Models

T eam Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Y ang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, T engfei Wang, and Chunchao Guo. Worldplay: T owards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Philip J. Ball, Jakob Bauer , Frank Belletti, Bethanie Brownﬁeld, Ariel Ephrat, Shlomi Fruchter , Agrim Gupta, Kris- tian Holsheimer , Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Y anko Oliveira, Jack Parker-Holder , Frank Perbet, Guy Scully , Jeremy Shar , Stephen Spencer , Omer T ov , Ruben Villegas, Emma Wang, Jessi...

2025

[10] [10]

Hunyuan-gamecraft-2: Instruction-following interactive game world model

Junshu T ang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Y ang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction-following interactive game world model. arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025

[11] [11]

Yume-1.5: A text-controlled interactive world generation model

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, T ong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model. arXiv preprint arXiv:2512.22096 , 2025

work page arXiv 2025

[12] [12]

Vidarc: Embodied video diffusion model for closed-loop control

Y ao Feng, Chendong Xiang, Xinyi Mao, Hengkai T an, Zuyue Zhang, Shuhe Huang, Kaiwen Zheng, Haitian Liu, Hang Su, and Jun Zhu. Vidarc: Embodied video diffusion model for closed-loop control. arXiv preprint arXiv:2512.17661, 2025

work page arXiv 2025

[13] [13]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with inﬁnite length. arXiv preprint arXiv:2512.04677, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Streamavatar: Streaming diffusion models for real-time interactive human avatars

Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Y ouliang Zhang, Yuan Zhou, Qinglin Lu, et al. Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 , 2025. 11 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

work page arXiv 2025

[15] [15]

Relic: Interactive video world model with long-horizon memory

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Y ang Zhou, Sai Bi, Y annick Hold-Geoffroy , Mike Roberts, Matthew Fisher , Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory . arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025

[16] [16]

Y an: Foundational interactive video generation

Deheng Y e, Fangyun Zhou, Jiacheng Lv , Jianqi Ma, Jun Zhang, Junyan Lv , Junyou Li, Minwen Deng, Mingyu Y ang, Qiang Fu, et al. Y an: Foundational interactive video generation. arXiv preprint arXiv:2508.08601 , 2025

work page arXiv 2025

[17] [17]

Pan: A world model for general, interactable, and long-horizon world simulation

Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Y ang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation. arXiv preprint arXiv:2511.09057, 2025

work page arXiv 2025

[18] [18]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Y angyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Motion- stream: Real-time video generation with interactive motion controls

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Y an Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motion- stream: Real-time video generation with interactive motion controls. arXiv preprint arXiv:2511.01266 , 2025

work page arXiv 2025

[20] [20]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025

[21] [21]

Diffusion adversarial post-training for one-step video generation

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Y ang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. arXiv preprint arXiv:2501.08316 , 2025

work page arXiv 2025

[22] [22]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation. arXiv preprint arXiv:2602.02214, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Y an, Xinyuan Li, Xiao Y ang, Chongxuan Li, and Jun Zhu. Causal forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation. arXiv preprint arXiv:2605.15141 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

T owards one-step causal video generation via adversarial self-distillation

Y ongqi Y ang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, and Yu Wu. T owards one-step causal video generation via adversarial self-distillation. arXiv preprint arXiv:2511.01419 , 2025

work page arXiv 2025

[26] [26]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors, Advances in Neural Information Processing Systems , volume 38, pages 15984–16009. Curran Associates, Inc., 2025

2025

[27] [27]

MAGI-1: Autoregressive Video Generation at Scale

Hansi T eng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu T ang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Proliﬁcdreamer: High- ﬁdelity and diverse text-to-3d generation with variational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Proliﬁcdreamer: High- ﬁdelity and diverse text-to-3d generation with variational score distillation. Advances in neural information process- ing systems, 36:8406–8441, 2023

2023

[29] [29]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36:76525–76546, 2023

2023

[30] [30]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and T aesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6613–6623, 2024

2024

[31] [31]

Scaling rectiﬁed ﬂow transformers for high-resolution image synthesis

Patrick Esser , Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller , Harry Saini, Y am Levi, Dominik Lorenz, Axel Sauer , Frederic Boesel, et al. Scaling rectiﬁed ﬂow transformers for high-resolution image synthesis. In Forty-ﬁrst international conference on machine learning , 2024

2024

[32] [32]

Score-Based Generative Modeling through Stochastic Differential Equations

Y ang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar , Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 , 2020. 12 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

work page internal anchor Pith review Pith/arXiv arXiv 2011

[33] [33]

Consistency models

Y ang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever . Consistency models. 2023

2023

[34] [34]

Spatialvid: A large-scale video dataset with spatial annotations

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Y outian Lin, Jian Gao, Lin-Zhuo Chen, Y ajie Bao, Yi Zhang, Chang Zeng, Y anxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations. arXiv preprint arXiv:2509.09676, 2025

work page arXiv 2025

[35] [35]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi T u, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Y awen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22160–22169, 2024

2024

[36] [36]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Y ang, Zhijie Chen, Xiang Li, Jian Y ang, and Ying T ai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 , 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024