Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Pith reviewed 2026-05-22 11:42 UTC · model grok-4.3
The pith
Causal Forcing uses an autoregressive teacher for ODE initialization to recover the teacher's flow map when distilling into causal video models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By initializing the autoregressive student via ODE distillation from an autoregressive teacher, Causal Forcing satisfies the frame-level injectivity condition that bidirectional teachers violate, thereby recovering the teacher's flow map rather than converging to a conditional-expectation solution, after which the DMD procedure produces superior few-step causal video generators.
What carries the argument
Causal Forcing, which replaces the bidirectional teacher with an autoregressive teacher solely for the ODE initialization step to enforce injectivity before applying DMD.
If this is right
- Autoregressive video generators distilled this way outperform prior Self Forcing baselines on dynamic degree, vision reward, and instruction following.
- Causal attention can replace full attention in the student without the performance penalty previously observed.
- Real-time interactive video generation becomes viable at higher visual and temporal fidelity.
- The same two-stage recipe (AR-teacher ODE init followed by DMD) applies to other diffusion-based sequence models.
Where Pith is reading between the lines
- The technique may extend to distilling other teacher-student pairs that differ in causality or attention scope.
- Longer video sequences or higher frame rates could be tested to check whether the injectivity benefit persists.
- Interactive applications such as live editing or simulation may see reduced latency once the distilled models run at full speed.
- Combining Causal Forcing with additional compression steps could push generation toward sub-frame latency.
Load-bearing premise
An autoregressive teacher produces frame-level injectivity under the PF-ODE so that the flow map can be recovered.
What would settle it
A controlled comparison in which an autoregressive teacher for ODE initialization yields equal or worse final AR student quality than a bidirectional teacher on the same downstream DMD stage would falsify the central claim.
Figures
read the original abstract
To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Causal Forcing to distill pretrained bidirectional video diffusion models into few-step autoregressive models for real-time interactive video generation. It identifies an architectural gap arising from replacing full attention with causal attention and proposes using an autoregressive teacher for ODE initialization to satisfy frame-level injectivity under the PF-ODE (allowing recovery of the teacher's flow map rather than a conditional-expectation solution), followed by the DMD procedure from Self Forcing. Empirical results claim outperformance over all baselines, including gains of 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following relative to Self Forcing.
Significance. If the frame-level injectivity assumption holds, the work supplies a mechanistically motivated fix for the bidirectional-to-autoregressive distillation gap and reports concrete metric improvements in dynamic content and instruction adherence. The public release of code and a project page is a clear strength for reproducibility and follow-up work.
major comments (1)
- Abstract: the central mechanistic claim is that 'frame-level injectivity' holds under the PF-ODE for an autoregressive teacher (due to causal attention) but is violated by bidirectional teachers (due to future context). No formal argument, injectivity proof, or numerical verification (e.g., checking uniqueness of the noisy-to-clean mapping on held-out frames) is supplied. This assumption is load-bearing for attributing the reported gains to flow-map recovery rather than to other factors such as training schedule or initialization details.
minor comments (1)
- The abstract reports specific percentage improvements but does not indicate whether they are averaged over multiple seeds or include error bars; adding this information would strengthen the empirical section.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We appreciate the recognition of the mechanistic motivation and the value of our public code release. We address the major comment below.
read point-by-point responses
-
Referee: Abstract: the central mechanistic claim is that 'frame-level injectivity' holds under the PF-ODE for an autoregressive teacher (due to causal attention) but is violated by bidirectional teachers (due to future context). No formal argument, injectivity proof, or numerical verification (e.g., checking uniqueness of the noisy-to-clean mapping on held-out frames) is supplied. This assumption is load-bearing for attributing the reported gains to flow-map recovery rather than to other factors such as training schedule or initialization details.
Authors: We agree that a more explicit justification would strengthen the manuscript. The core intuition, as stated in the paper, is that causal attention in the autoregressive teacher restricts the PF-ODE evolution of frame t to depend only on frames 1 through t. This per-frame conditioning makes the mapping from a noisy frame to its clean counterpart unique under the teacher's flow, satisfying frame-level injectivity. Bidirectional attention, by contrast, allows future-frame information to influence the ODE trajectory of earlier frames, rendering the per-frame mapping non-injective and yielding a conditional-expectation solution instead of the teacher's flow map. While the initial submission relied on this architectural reasoning without a formal injectivity proof or additional numerical checks, we will add a dedicated paragraph in Section 3 together with a simple numerical verification on a low-dimensional toy diffusion model to confirm uniqueness of the noisy-to-clean mapping for causal versus bidirectional teachers. We believe these additions will better isolate the contribution of the AR-teacher initialization from other training factors; the existing ablations already show that replacing the bidirectional teacher with an autoregressive one yields the reported gains even under matched schedules. revision: partial
Circularity Check
Minor self-citation to prior DMD procedure; core AR-teacher initialization is independent
full rationale
The paper re-uses the DMD procedure from Self Forcing but introduces a distinct initialization step that relies on the stated frame-level injectivity property of autoregressive teachers under the PF-ODE. This assumption is presented as a direct consequence of causal attention lacking future context, separate from any fitted parameters or self-referential definitions within the current work. No equation or derivation reduces by construction to prior outputs of the same run, and the empirical comparisons are reported as external validation. The self-citation is not load-bearing for the novel contribution and does not trigger higher circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frame-level injectivity is required for ODE distillation to recover the teacher's flow map rather than a conditional-expectation solution.
Forward citations
Cited by 22 Pith papers
-
Q-ARVD: Quantizing Autoregressive Video Diffusion Models
Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
-
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
-
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
WorldKV: Efficient World Memory with World Retrieval and Compression
WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
-
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
DySink uses adaptive retrieval of relevant historical frames plus a sink anomaly gate to improve dynamic degree and temporal quality in minute-long autoregressive video generation.
-
Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving
Xiaomi EV World Model integrates WorldRec for sparse-query 3D Gaussian reconstruction and WorldGen for fast causal video generation via bidirectional pretraining and causal fine-tuning to support autonomous driving si...
-
FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
FashionChameleon achieves interactive multi-garment video customization in real time by training a teacher model with in-context learning on single-garment pairs, applying streaming distillation, and using training-fr...
-
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
-
Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion
Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while impr...
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
Reference graph
Works this paper leans on
-
[1]
Bao, F., Xiang, C., Yue, G., He, G., Zhu, H., Zheng, K., Zhao, M., Liu, S., Wang, Y ., and Zhu, J. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233,
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align you...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
SkyReels-V2: Infinite-length Film Generative Model
Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al. Skyreels- v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y ., and Hsieh, C.-J. Self-forcing++: To- wards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Autoregressive Video Generation without Vector Quantization
Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y ., Lu, H., Shan, S., Qi, Y ., and Wang, X. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661,
Feng, Y ., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., and Zhu, J. Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661,
- [8]
-
[9]
Mean Flows for One-step Generative Modeling
Geng, Z., Deng, M., Bai, X., Kolter, J. Z., and He, K. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Guo, Y ., Yang, C., He, H., Zhao, Y ., Wei, M., Yang, Z., Huang, W., and Lin, D. End-to-end training for au- toregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,
-
[11]
LTX-Video: Realtime Video Latent Diffusion
HaCohen, Y ., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
He, Y ., Yang, T., Zhang, Y ., Shan, Y ., and Chen, Q. La- tent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Imagen Video: High Definition Video Generation with Diffusion Models
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video gener- ation via transformers.arXiv preprint arXiv:2205.15868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,
Hong, Y ., Mei, Y ., Ge, C., Xu, Y ., Zhou, Y ., Bi, S., Hold- Geoffroy, Y ., Roberts, M., Fisher, M., Shechtman, E., et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,
-
[16]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., and Shechtman, E. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025a. Huang, Y ., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite len...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,
Jin, Y ., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y ., Mu, Y ., and Lin, Z. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,
- [18]
-
[19]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V ., Yan, J., Chiu, M.-C., et al. Videopoet: A large language model for zero- shot video generation.arXiv preprint arXiv:2312.14125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Open-Sora Plan: Open-Source Large Video Generation Model
Lin, B., Ge, Y ., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y ., Yuan, S., Chen, L., et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Flow Matching for Generative Modeling
Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Liu, K., Hu, W., Xu, J., Shan, Y ., and Lu, S. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Lu, C. and Song, Y . Simplifying, stabilizing and scal- ing continuous-time consistency models.arXiv preprint arXiv:2410.11081,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Luo, S., Tan, Y ., Huang, L., Li, J., and Zhao, H. La- tent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023a. Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., and Zhang, Z. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural In...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096,
10 Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation Mao, X., Li, Z., Li, C., Xu, X., Ying, K., He, T., Pang, J., Qiao, Y ., and Zhang, K. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096,
-
[28]
R., Chen, C., and Wetzstein, G
Po, R., Chan, E. R., Chen, C., and Wetzstein, G. Bag- ger: Backwards aggregation for mitigating drift in au- toregressive video diffusion models.arXiv preprint arXiv:2512.12080,
-
[29]
Movie Gen: A Cast of Media Foundation Models
Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y ., Chuang, C.-Y ., et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Progressive Distillation for Fast Sampling of Diffusion Models
Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Shin, J., Li, Z., Zhang, R., Zhu, J.-Y ., Park, J., Shechtman, E., and Huang, X. Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,
-
[32]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a- video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
History-Guided Video Diffusion
Song, K., Chen, B., Simchowitz, M., Du, Y ., Tedrake, R., and Sitzmann, V . History-guided video diffusion.arXiv preprint arXiv:2502.06764,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Improved Techniques for Training Consistency Models
Song, Y . and Dhariwal, P. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[36]
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Sun, W., Zhang, H., Wang, H., Wu, J., Wang, Z., Wang, Z., Wang, Y ., Zhang, J., Wang, T., and Guo, C. World- play: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025a. Sun, Z., Peng, Z., Ma, Y ., Chen, Y ., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y ., Zhou, Y ., Lu, Q., et al. Streama- vata...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
MAGI-1: Autoregressive Video Generation at Scale
Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al. Magi-1: Au- toregressive video generation at scale.arXiv preprint arXiv:2505.13211,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,
Weissenborn, D., T ¨ackstr¨om, O., and Uszkoreit, J. Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,
-
[40]
Godiva: Generating open-domain videos from natural descriptions
Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., Sapiro, G., and Duan, N. Godiva: Generating open- domain videos from natural descriptions.arXiv preprint arXiv:2104.14806,
-
[41]
Wu, X., Zhang, G., Xu, Z., Zhou, Y ., Lu, Q., and He, X. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784,
-
[42]
Xi, H., Yang, S., Zhao, Y ., Xu, C., Li, M., Li, X., Lin, Y ., Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,
-
[43]
Xiao, S., Zhang, X., Meng, D., Wang, Q., Zhang, P., and Zhang, B. Knot forcing: Taming autoregressive video diffusion models for real-time infinite interactive portrait animation.arXiv preprint arXiv:2512.21734,
-
[44]
Xu, J., Huang, Y ., Cheng, J., Yang, Y ., Xu, J., Wang, Y ., Duan, W., Yang, S., Jin, Q., Li, S., et al. Visionre- ward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
VideoGPT: Video Generation using VQ-VAE and Transformers
Yan, W., Zhang, Y ., Abbeel, P., and Srinivas, A. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
LongLive: Real-time Interactive Long Video Generation
Yang, S., Huang, W., Chu, R., Xiao, Y ., Zhao, Y ., Wang, X., Li, M., Xie, E., Chen, Y ., Lu, Y ., et al. Longlive: Real- time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025a. Yang, Y ., Huang, H., Peng, X., Hu, X., Luo, D., Zhang, J., Wang, C., and Wu, Y . Towards one-step causal video generation via adversarial self-distillation...
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
H., Nam, J., Yoon, H., and Kim, S
Yi, J., Jang, W., Cho, P. H., Nam, J., Yoon, H., and Kim, S. Deep forcing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081,
-
[48]
Zhao, M., He, G., Chen, Y ., Zhu, H., Li, C., and Zhu, J. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a. Zhao, M., Wang, R., Bao, F., Li, C., and Zhu, J. Con- trolvideo: conditional control for one-shot text-driven video editing and beyond.Science China Information Sciences, 68(3):1321...
-
[49]
Open-Sora: Democratizing Efficient Video Production for All
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y . Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
12 Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation A. Extended Related Work Video Generative Models.Building on the tremendous success of diffusion models, many works have applied them to video generation (He et al., 2022; Ho et al., 2022; Singer et al., 2022; Blattmann et al., 2023a...
work page 2022
-
[51]
and Wan2.1 (Wan et al., 2025). Apart from the full-sequence diffusion models, some works adopt autoregressive next-token prediction to enable video generation (Wu et al., 2021; Hong et al., 2022; Wu et al., 2022; Weissenborn et al., 2019; Yan et al., 2021; Zhao et al., 2025c;a), such as NOV A (Deng et al.,
work page 2025
-
[52]
and VideoPoet (Kondratyuk et al., 2023). Video generation based on full-sequence diffusion models currently achieves better overall quality than autoregressive next-token prediction. However, full-sequence diffusion models must generate all frames in one shot, which incurs substantial latency and prevents displaying frames to users as they are produced, h...
work page 2023
-
[53]
and Self Forcing (Huang et al., 2025a) introduce distillation strategies to obtain few-step generation models. Such real-time, interactive video generation models are highly promising and have broad applications across many domains. One prominent application is video world modeling. HY-WorldPlay (Sun et al., 2025a), RELIC (Hong et al., 2025), Hunyuan-Game...
work page 2025
-
[54]
train real-time interactive video models for realistic world simulation, allowing users to freely explore and take actions in the simulated environment. This interactive world-modeling paradigm further enables embodied intelligence, such as closed-loop control in Vidarc (Feng et al., 2025). Another major application lies in entertainment and media, suppor...
work page 2025
-
[55]
Equivalently, P(Var(ϕ(xt, t)u |x u t , t)>0)>0
imply the following: for the above z1,z 2, in a neighborhood of z2 there exist uncountably many zk, each of which maps to a distinct ϕ(xt, t)u, just as z2 does. Equivalently, P(Var(ϕ(xt, t)u |x u t , t)>0)>0. We next prove Proposition 3.3. First, we formalize this in the following statement. Proposition B.2(Distribution mismatch in chunk-wise regression)....
work page 2025
-
[56]
More Discussion of Our Method C.1
51 3.336 22 C. More Discussion of Our Method C.1. Further Remarks on Autoregressive Diffusion Training Strategies In this section, we first provide further remarks on diffusion forcing, and then report results for other training strategies, including PFVG (Wu et al., 2025), BAgger (Po et al., 2025), and Resampling Forcing (Guo et al., 2025). As stated in ...
work page 2025
-
[57]
and recent works (e.g., LiveAvatar (Huang et al., 2025b)). Apart from diffusion forcing and teacher forcing, we also experiment with several recent alternatives, including PFVG (Wu et al., 2025), BAgger (Po et al., 2025), and Resampling Forcing (Guo et al., 2025). However, as shown in Tab. 3, these methods provide no significant improvement over teacher f...
-
[58]
However, since we use flow matching, i.e., av-prediction parameterization for the diffusion 18 Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation Type Generated video Asymmetric CD … Causal CD … Figure 10.Comparison between asymmetric CD and causal CD.Asymmetric CD appears highly blurry...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.