arxiv: 2603.28489 · v2 · submitted 2026-03-30 · 📡 eess.IV · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

Hanzhong Guo, Junxiong Lin, Muyang He, Yizhou Yu

Pith reviewed 2026-05-14 01:31 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords video generationworld modelsefficient architecturesinference algorithmsphysical simulationautonomous driving

0 comments

The pith

Efficiency techniques can turn video generation models into usable world simulators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews methods to reduce the computational cost of video generation so these models can function as practical simulators of physical dynamics and long-term cause-and-effect. It organizes the literature into a taxonomy with three parts: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. The authors state that closing the efficiency gap would directly support interactive uses such as autonomous driving, embodied AI, and game environments. They end by listing open directions for building real-time, robust video-based world models.

Core claim

Video generation models already hold the theoretical ability to simulate complex physical dynamics and long-horizon causalities, yet high costs of spatiotemporal modeling keep them from serving as usable world models. The paper supplies a three-part taxonomy of efficiency improvements and shows that these improvements would enable practical applications in autonomous driving, embodied AI, and game simulation.

What carries the argument

Three-dimensional taxonomy covering efficient modeling paradigms, efficient network architectures, and efficient inference algorithms.

Load-bearing premise

Video generation models already contain enough capacity to simulate physical dynamics and causalities, so the main barrier is simply computational cost.

What would settle it

A controlled test in which an efficiency-optimized video model runs in real time yet produces large errors when predicting object trajectories or collisions over many future frames in a driving scene.

Figures

Figures reproduced from arXiv: 2603.28489 by Hanzhong Guo, Junxiong Lin, Muyang He, Yizhou Yu.

**Figure 1.** Figure 1: A taxonomy of representative topics related to efficiency improvement for video generation-based world models. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of long-video generation. We evaluate six methods built upon Wan2.1-T2V-1.3B [38] across a 10-minute timeline. Red boxes are added to highlight specific failure modes for each method. Self Forcing [70]: inherently limited to a maximum generation length of 5 seconds (81 frames), leading to catastrophic structural collapse (e.g., at 40s), outfit inconsistencies (e.g., at 2m and 10m), a… view at source ↗

**Figure 4.** Figure 4: LoViC [101] introduces FlexFormer, a flexible encoder that com [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of inference time among full attention, SVG [107], [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: LongLive [77] processes sequential user prompts and generates a [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: VidToMe [24] first merges tokens locally and then combines the [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of street-scene videos generated by MagicDrive-V2, which [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 10.** Figure 10: Comparison of videos generated by methods designed for real [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that organizes efficiency work on video generation for world models into a three-part taxonomy, but it takes the models' ability to simulate physics as given without new checks.

read the letter

This survey paper introduces a three-dimensional taxonomy for efficient video generation models used as world simulators. The dimensions are efficient modeling paradigms, network architectures, and inference algorithms. It pulls together existing techniques and points to how efficiency improvements could support applications like autonomous driving and embodied AI. The paper does a good job structuring the field. Having a clear taxonomy helps make sense of the many approaches out there, from different generation frameworks to specific network designs and sampling methods. It highlights efficiency as key for turning video generators into practical tools for simulation. Where it falls short is in the foundational assumption. The abstract and framing claim that video generation has already enabled simulation of complex physical dynamics and long-horizon causalities, leaving only computation as the barrier. But the paper itself is a review and offers no new benchmarks or citations to controlled tests of physical accuracy, such as conservation laws or intervention-based predictions. That makes the push for efficiency feel a bit ahead of the evidence on whether these models are good simulators to begin with. This work is aimed at researchers in video synthesis and world modeling who need an overview of efficiency strategies. It could be valuable for organizing thoughts on future directions. I would recommend sending it to peer review because the taxonomy provides a fresh way to categorize the literature, and surveys like this can be useful even if they don't add new experiments.

Referee Report

1 major / 1 minor

Summary. The paper is a survey on video generation models positioned as world simulators. It reviews frameworks and techniques with a focus on efficiency, introducing a three-dimensional taxonomy covering efficient modeling paradigms, network architectures, and inference algorithms. The work argues that closing the computational gap will enable practical applications in autonomous driving, embodied AI, and game simulation, while identifying future research directions.

Significance. As a conceptual survey organizing existing literature on efficiency in video-based world modeling, the taxonomy could serve as a useful reference point for the community if it proves comprehensive. Highlighting efficiency as a prerequisite for real-time interactive simulators aligns with practical needs in the cited applications, though its impact depends on the accuracy and completeness of the categorization rather than new empirical contributions.

major comments (1)

[Abstract] Abstract: The central positioning that video generation models have already 'enabled' simulation of complex physical dynamics and long-horizon causalities lacks supporting citations to controlled benchmarks or quantitative evaluations of physical fidelity (e.g., momentum conservation, intervention-based state prediction, or deviation from Newtonian trajectories). This assumption is load-bearing for the claim that efficiency is the sole remaining barrier to reliable world simulation.

minor comments (1)

The three-dimensional taxonomy would benefit from an explicit summary table or diagram early in the manuscript to clarify the relationships between paradigms, architectures, and algorithms.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comment on our survey. We have addressed the concern regarding the abstract's positioning by planning targeted revisions to include supporting citations while preserving the paper's core argument on efficiency.

read point-by-point responses

Referee: [Abstract] Abstract: The central positioning that video generation models have already 'enabled' simulation of complex physical dynamics and long-horizon causalities lacks supporting citations to controlled benchmarks or quantitative evaluations of physical fidelity (e.g., momentum conservation, intervention-based state prediction, or deviation from Newtonian trajectories). This assumption is load-bearing for the claim that efficiency is the sole remaining barrier to reliable world simulation.

Authors: We acknowledge the validity of this observation. The abstract's phrasing draws from the reviewed literature demonstrating video models' capacity for physical simulation in specific settings, yet we agree that explicit citations to quantitative benchmarks on physical fidelity (e.g., conservation laws or intervention tests) would strengthen the claim and clarify the load-bearing assumption. In the revised manuscript, we will add references to relevant controlled evaluations from the surveyed works and adjust the abstract wording to note demonstrated capabilities alongside remaining limitations. This revision supports rather than undermines our central thesis that efficiency is the key remaining barrier to practical deployment, as even validated physical fidelity does not address computational scalability for real-time use. revision: yes

Circularity Check

0 steps flagged

Survey paper with no derivation chain; claims rest on external literature review

full rationale

This manuscript is a comprehensive review and taxonomy of existing video generation methods, introducing organizational categories (efficient paradigms, architectures, algorithms) without any new equations, fitted parameters, or first-principles derivations. No step claims a 'prediction' that reduces to a self-fit, self-citation chain, or renamed ansatz. The abstract and context explicitly frame the work as surveying published techniques, with efficiency positioned as an external engineering requirement rather than a quantity derived from the paper's own inputs. Central claims about world-simulation capacity are presented as motivation drawn from prior literature, not as internally validated results. This satisfies the self-contained benchmark: the paper's structure is independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a review paper that proposes an organizational taxonomy based on existing literature. No new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5443 in / 963 out tokens · 40687 ms · 2026-05-14T01:31:20.467750+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Efficient modeling paradigms, efficient network architectures, and efficient inference algorithms

Reference graph

Works this paper leans on

254 extracted references · 254 canonical work pages · 20 internal anchors

[1]

Generative adversarial nets,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014

work page 2014
[2]

Temporal generative adver- sarial nets with singular value clipping,

M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adver- sarial nets with singular value clipping,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2830–2839

work page 2017
[3]

Video pixel networks,

N. Kalchbrenner, A. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” inInterna- tional Conference on Machine Learning. PMLR, 2017, pp. 1771– 1779

work page 2017
[4]

VideoGPT: Video Generation using VQ-VAE and Transformers

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video gener- ation using vq-vae and transformers,”arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” inInternational conference on machine learning. Pmlr, 2021, pp. 8821–8831. 16

work page 2021
[6]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[7]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimanset al., “Photorealistic text-to-image diffusion models with deep language understanding,”Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022

work page 2022
[8]

Video diffusion models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in neural information processing systems, vol. 35, pp. 8633–8646, 2022

work page 2022
[9]

Score-based generative modeling through stochastic differential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/ forum?id=PxTIG12RRHS

work page 2021
[10]

Align your latents: High-resolution video synthesis with latent diffusion models,

A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 563–22 575

work page 2023
[11]

Improving image generation with bet- ter captions,

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with bet- ter captions,”Computer Science. https://cdn. openai. com/papers/dall- e-3. pdf, vol. 2, no. 3, p. 8, 2023

work page 2023
[12]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[13]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[14]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhmanet al., “Video generation models as world simulators,” 2024. [Online]. Available: https: //openai.com/research/video-generation-models-as-world-simulators

work page 2024
[15]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

work page 2023
[16]

World Models

D. Ha and J. Schmidhuber, “World models,”arXiv preprint arXiv:1803.10122, vol. 2, no. 3, p. 440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Learning interactive real-world simulators,

S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel, “Learning interactive real-world simulators,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=sFyTZEqmUY

work page 2024
[18]

GAIA-1: A Generative World Model for Autonomous Driving

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado, “Gaia-1: A generative world model for autonomous driving,”arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Drivedreamer: Towards real-world-drive world models for autonomous driving,

X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu, “Drivedreamer: Towards real-world-drive world models for autonomous driving,” in European conference on computer vision. Springer, 2024, pp. 55–72

work page 2024
[20]

Learning universal policies via text-guided video generation,

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schu- urmans, and P. Abbeel, “Learning universal policies via text-guided video generation,”Advances in neural information processing systems, vol. 36, pp. 9156–9172, 2023

work page 2023
[21]

Google DeepMind. Genie 3. Google DeepMind. [Online]. Available: https://deepmind.google/models/genie/

work page
[22]

Stable video infinity: Infinite-length video generation with error recycling,

W. Li, W. Pan, P.-C. Luan, Y . Gao, and A. Alahi, “Stable video infinity: Infinite-length video generation with error recycling,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=X96Ei9n34a

work page 2026
[23]

World- pack: Compressed memory improves spatial consistency in video world modeling,

Y . Oshima, Y . Iwasawa, M. Suzuki, Y . Matsuo, and H. Furuta, “World- pack: Compressed memory improves spatial consistency in video world modeling,”arXiv preprint arXiv:2512.02473, 2025

work page arXiv 2025
[24]

Vidtome: Video token merging for zero-shot video editing,

X. Li, C. Ma, X. Yang, and M.-H. Yang, “Vidtome: Video token merging for zero-shot video editing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7486–7495

work page 2024
[25]

Dc-videogen: Efficient video gen- eration with deep compression video autoencoder,

J. Chen, W. He, Y . Gu, Y . Zhao, J. Yu, J. Chen, D. Zou, Y . Lin, Z. Zhang, M. Liet al., “Dc-videogen: Efficient video gen- eration with deep compression video autoencoder,”arXiv preprint arXiv:2509.25182, 2025

work page arXiv 2025
[26]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[27]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=PqvMRDCJT9t

work page 2023
[28]

Instaflow: One step is enough for high-quality diffusion-based text-to-image generation,

X. Liu, X. Zhang, J. Ma, J. Peng, and qiang liu, “Instaflow: One step is enough for high-quality diffusion-based text-to-image generation,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/ forum?id=1k4yZbbDqX

work page 2024
[29]

Taming transformers for high-resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883

work page 2021
[30]

Latte: Latent diffusion transformer for video generation,

X. Ma, Y . Wang, X. Chen, G. Jia, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao, “Latte: Latent diffusion transformer for video generation,”Transactions on Machine Learning Research, 2025. [Online]. Available: https://openreview.net/forum?id=ntGPYNUF3t

work page 2025
[31]

Magvit: Masked gen- erative video transformer,

L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essaet al., “Magvit: Masked gen- erative video transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 459–10 469

work page 2023
[32]

Cogvideox: Text-to- video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to- video diffusion models with an expert transformer,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=LQzN6TRFg9

work page 2025
[33]

U-net: Convolutional net- works for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- works for biomedical image segmentation,” inInternational Confer- ence on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

work page 2015
[34]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[35]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

work page 2020
[36]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wuet al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Y . Gao, H. Guo, T. Hoang, W. Huang, L. Jiang, F. Kong, H. Li, J. Li, L. Li, X. Liet al., “Seedance 1.0: Exploring the boundaries of video generation models,”arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025a

B. Wu, C. Zou, C. Li, D. Huang, F. Yang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jianget al., “Hunyuanvideo 1.5 technical report,”arXiv preprint arXiv:2511.18870, 2025

work page arXiv 2025
[40]

Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control,

R. Gao, K. Chen, B. Xiao, L. Hong, Z. Li, and Q. Xu, “Magicdrive- v2: High-resolution long video generation for autonomous driving with adaptive control,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28 135–28 144

work page 2025
[41]

Genie: Generative interactive environments,

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Appset al., “Genie: Generative interactive environments,” inForty-first Interna- tional Conference on Machine Learning, 2024

work page 2024
[42]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

X. He, C. Peng, Z. Liu, B. Wang, Y . Zhang, Q. Cui, F. Kang, B. Jiang et al., “Matrix-game 2.0: An open-source real-time and streaming interactive world model,”arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

World Simulation with Video Foundation Models for Physical AI

A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chaoet al., “World simulation with video foundation models for physical ai,”arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,

L. Tian, Q. Wang, B. Zhang, and L. Bo, “Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 244–260

work page 2024
[45]

Ditto: Motion- space diffusion for controllable realtime talking head synthesis,

T. Li, R. Zheng, M. Yang, J. Chen, and M. Yang, “Ditto: Motion- space diffusion for controllable realtime talking head synthesis,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9704–9713

work page 2025
[46]

Soulx-flashtalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation,

L. Shen, Q. Qiao, T. Yu, K. Zhou, T. Yu, Y . Zhan, Z. Wang, M. Tao, S. Yin, and S. Liu, “Soulx-flashtalk: Real-time infinite streaming of audio-driven avatars via self-correcting bidirectional distillation,”

work page
[47]

Available: https://arxiv.org/abs/2512.23379 17

[Online]. Available: https://arxiv.org/abs/2512.23379 17

work page arXiv
[48]

Infinitetalk: Audio-driven video generation for sparse-frame video dubbing,

S. Yang, Z. Kong, F. Gao, M. Cheng, X. Liu, Y . Zhang, Z. Kang, W. Luo, X. Cai, R. He, and X. Wei, “Infinitetalk: Audio-driven video generation for sparse-frame video dubbing,” 2025. [Online]. Available: https://arxiv.org/abs/2508.14033

work page arXiv 2025
[49]

MEMO: Memory-guided diffusion for expressive talking video generation,

L. Zheng, Y . Zhang, H. Guo, J. Pan, Z. Tan, J. Lu, C. Tang, B. An, and S. Y AN, “MEMO: Memory-guided diffusion for expressive talking video generation,”Transactions on Machine Learning Research, 2026, j2C Certification. [Online]. Available: https://openreview.net/forum?id=uBcHcM7Kzi

work page 2026
[50]

Magicinfinite: Generating infinite talking videos with your words and voice,

H. Yi, T. Ye, S. Shao, X. Yang, J. Zhao, H. Guo, T. Wang, Q. Yin, Z. Xie, L. Zhuet al., “Magicinfinite: Generating infinite talking videos with your words and voice,”arXiv preprint arXiv:2503.05978, 2025

work page arXiv 2025
[51]

ivideoGPT: Interactive videoGPTs are scalable world models,

J. Wu, S. Yin, N. Feng, X. He, D. Li, J. HAO, and M. Long, “ivideoGPT: Interactive videoGPTs are scalable world models,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id= 4TENzBftZR

work page 2024
[52]

Progressive distillation for fast sampling of diffusion models,

T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/ forum?id=TIdIXIpzhoI

work page 2022
[53]

One step diffusion via shortcut models,

K. Frans, D. Hafner, S. Levine, and P. Abbeel, “One step diffusion via shortcut models,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=OlzB6LnXcS

work page 2025
[54]

Gpd: Guided progressive dis- tillation for fast and high-quality video generation,

X. Liang, Y . Zhang, and L. Zhu, “Gpd: Guided progressive dis- tillation for fast and high-quality video generation,”arXiv preprint arXiv:2602.01814, 2026

work page arXiv 2026
[55]

Consistency models,

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 32 211–32 252

work page 2023
[56]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

S. Luo, Y . Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high-resolution images with few-step inference,” arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Videolcm: Video latent consistency model,

X. Wang, S. Zhang, H. Zhang, Y . Liu, Y . Zhang, C. Gao, and N. Sang, “Videolcm: Video latent consistency model,”arXiv preprint arXiv:2312.09109, 2023

work page arXiv 2023
[58]

Animatelcm: Computation-efficient personalized style video generation without personalized video data,

F.-Y . Wang, Z. Huang, W. Bian, X. Shi, K. Sun, G. Song, Y . Liu, and H. Li, “Animatelcm: Computation-efficient personalized style video generation without personalized video data,” inSIGGRAPH Asia 2024 Technical Communications, 2024, pp. 1–5

work page 2024
[59]

Turbodiffusion: Accelerating video diffusion models by 100-200 times,

J. Zhang, K. Zheng, K. Jiang, H. Wang, I. Stoica, J. E. Gonzalez, J. Chen, and J. Zhu, “Turbodiffusion: Accelerating video diffusion models by 100-200 times,”arXiv preprint arXiv:2512.16093, 2025

work page arXiv 2025
[60]

Fastvideo,

F. Team, “Fastvideo,” 2025, https://huggingface.co/FastVideo

work page 2025
[61]

One-step diffusion with distribution matching distillation,

T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park, “One-step diffusion with distribution matching distillation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 6613–6623

work page 2024
[62]

Improved distribution matching distillation for fast image synthesis,

T. Yin, M. Gharbi, T. Park, R. Zhang, E. Shechtman, F. Durand, and B. Freeman, “Improved distribution matching distillation for fast image synthesis,”Advances in neural information processing systems, vol. 37, pp. 47 455–47 487, 2024

work page 2024
[63]

Accelerating video diffusion models via distribution matching,

Y . Zhu, H. Yan, H. Yang, K. Zhang, and J. Li, “Accelerating video diffusion models via distribution matching,”arXiv preprint arXiv:2412.05899, 2024

work page arXiv 2024
[64]

Diffusion adversarial post-training for one-step video generation,

S. Lin, X. Xia, Y . Ren, C. Yang, X. Xiao, and L. Jiang, “Diffusion adversarial post-training for one-step video generation,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=AAgzsnhc28

work page 2025
[65]

Videopoet: A large language model for zero-shot video generation,

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiuet al., “Videopoet: A large language model for zero-shot video generation,” inForty-first International Conference on Machine Learning, 2024. [Online]. Available: https://openreview.net/forum?id=LRkJwPIDuE

work page 2024
[66]

Loong: Generating minute-level long videos with autoregres- sive language models,

Y . Wang, T. Xiong, D. Zhou, Z. Lin, Y . Zhao, B. Kang, J. Feng, and X. Liu, “Loong: Generating minute-level long videos with autoregres- sive language models,”arXiv preprint arXiv:2410.02757, 2024

work page arXiv 2024
[67]

Progressive autoregressive video diffusion models,

D. Xie, Z. Xu, Y . Hong, H. Tan, D. Liu, F. Liu, A. Kaufman, and Y . Zhou, “Progressive autoregressive video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2025, pp. 6376–6386

work page 2025
[68]

Frame context packing and drift prevention in next-frame-prediction video diffusion models,

L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame context packing and drift prevention in next-frame-prediction video diffusion models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=J8JCF64aEn

work page 2025
[69]

From slow bidirectional to fast autoregressive video diffusion models,

T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From slow bidirectional to fast autoregressive video diffusion models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 963–22 974

work page 2025
[70]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

B. Chen, D. M. Mons ´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next-token prediction meets full-sequence diffusion,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=yDo1ynArjj

work page 2024
[71]

Self forcing: Bridging the train-test gap in autoregressive video diffusion,

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= mSiN7i0BYH

work page 2025
[72]

Rolling forcing: Autoregressive long video diffusion in real time,

K. Liu, W. Hu, J. Xu, Y . Shan, and S. Lu, “Rolling forcing: Autoregressive long video diffusion in real time,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=IAyzXjbfwo

work page 2026
[73]

Self-forcing++: Towards minute-scale high-quality video generation,

J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=DzvPiqh23f

work page 2026
[74]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,

Y . Lu, Y . Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhuet al., “Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,” arXiv preprint arXiv:2512.04678, 2025

work page arXiv 2025
[75]

Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal forcing: Autoregressive diffusion distillation done right for high-quality real- time interactive video generation,”arXiv preprint arXiv:2602.02214, 2026

work page arXiv 2026
[76]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

H. Li, S. Liu, Z. Lin, and M. Chandraker, “Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion,”arXiv preprint arXiv:2602.07775, 2026. [Online]. Available: https://arxiv.org/abs/2602.07775

work page internal anchor Pith review Pith/arXiv arXiv 2026
[77]

Autoregressive adversarial post-training for real-time interactive video generation,

S. Lin, C. Yang, H. He, J. Jiang, Y . Ren, X. Xia, Y . Zhao, X. Xiao, and L. Jiang, “Autoregressive adversarial post-training for real-time interactive video generation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=lF6SHARvmG

work page 2025
[78]

Longlive: Real-time interactive long video generation,

S. Yang, W. Huang, R. Chu, Y . Xiao, Y . Zhao, X. Wang, M. Li, E. Xie, Y .-C. Chen, Y . Luet al., “Longlive: Real-time interactive long video generation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=nCAODkpsPJ

work page 2026
[79]

Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

J. Yi, W. Jang, P. H. Cho, J. Nam, H. Yoon, and S. Kim, “Deep forcing: Training-free long video generation with deep sink and participative compression,”arXiv preprint arXiv:2512.05081, 2025

work page arXiv 2025
[80]

Pyramidal flow matching for efficient video generative modeling,

Y . Jin, Z. Sun, N. Li, K. Xu, K. Xu, H. Jiang, N. Zhuang, Q. Huang, Y . Song, Y . MU, and Z. Lin, “Pyramidal flow matching for efficient video generative modeling,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=66NzcRQuOq

work page 2025

Showing first 80 references.