pith. machine review for the scientific record. sign in

arxiv: 2510.02283 · v1 · submitted 2025-10-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video generationdiffusion modelslong videoautoregressive generationself-guidancetemporal consistencyerror accumulation
0
0 comments X

The pith

Self-generated segments from a video model steer it to produce coherent four-minute clips without long-video teachers or retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion-based video generators degrade when asked to continue past the short horizons their teacher models can supervise. This paper shows that sampling short segments from the student model's own extended outputs supplies enough guidance to keep latent-space errors from compounding. The resulting Self-Forcing++ procedure therefore scales output length by roughly twenty times the teacher's limit while preserving frame-to-frame consistency. No overlapping-frame recomputation or new long-video datasets are required. The approach reaches 4 minutes 15 seconds of video, which is 99.9 percent of the base model's position-embedding capacity and more than fifty times the length of the prior baseline.

Core claim

By drawing guidance segments directly from the student model's own long self-generated videos, the method supplies reliable teacher-like signals that prevent error accumulation in the continuous latent space, thereby enabling high-fidelity video generation up to 4 minutes 15 seconds long without any long-horizon teacher or additional training data.

What carries the argument

The self-forcing++ loop that repeatedly samples short segments from the model's own extended generations and feeds them back as conditioning to maintain consistency across the full sequence.

If this is right

  • Temporal consistency is preserved while video length scales up to twenty times beyond the teacher's horizon.
  • Over-exposure and error accumulation are avoided without recomputing overlapping frames.
  • Generation reaches 99.9 percent of the base model's position-embedding span.
  • Fidelity and consistency scores exceed those of prior baselines on both standard and newly proposed benchmarks.
  • No retraining on long-video datasets is needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-sampling idea could be tested on other autoregressive modalities such as long audio or 3-D scene sequences.
  • If self-guidance works here, it may reduce reliance on ever-larger supervised datasets for extending context in generative models.
  • Position-embedding limits rather than error accumulation may become the next practical bottleneck once this technique is applied.

Load-bearing premise

Segments taken from the model's own long outputs remain high-quality enough to act as stable guidance without injecting new compounding errors.

What would settle it

Run the method on a target of five minutes and measure whether visual fidelity or temporal consistency drops measurably below the quality achieved at four minutes fifteen seconds.

read the original abstract

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Self-Forcing++, a method for long-horizon autoregressive video generation that uses sampled segments from the student model's own self-generated videos as guidance signals. This is intended to mitigate compounding errors in the continuous latent space without access to long-video teacher models or retraining on extended datasets, enabling videos up to 4 minutes 15 seconds (99.9% of the base model's position-embedding span and >50x longer than the baseline) while preserving temporal consistency and visual fidelity.

Significance. If the central premise holds, the work would meaningfully advance scalable video synthesis by demonstrating that self-guidance can substitute for unavailable teacher supervision at extreme lengths, potentially reducing data and compute barriers for minute-scale generation. The reported length extension and benchmark gains would represent a practical step beyond current short-horizon distillation limits.

major comments (3)
  1. [Method and Experiments] The central claim that self-generated segments supply reliable, non-degrading guidance rests on an unverified premise: no quantitative characterization of latent-space error growth (e.g., drift in continuous codes) within the initial long roll-outs is provided, nor is there an ablation isolating segment quality or a comparison against oracle teacher segments. This directly affects the soundness of the 20x scaling and 4:15 capability assertions.
  2. [Abstract and Experiments] Abstract and results sections: the reported improvements in fidelity and consistency are stated without accompanying quantitative metrics (specific FVD, FID, or temporal consistency scores with error bars and baseline deltas), segment-selection criteria, or ablation studies on how self-guidance avoids the error accumulation it claims to solve.
  3. [Results and Scaling Experiments] The 99.9% position-embedding utilization claim is load-bearing for the headline result, yet the manuscript does not detail the mechanism by which self-guidance prevents degradation at this horizon or provide controls (e.g., generation without self-segments) to isolate its contribution.
minor comments (2)
  1. [Abstract and Experiments] The abstract references 'our proposed improved benchmark' without a clear description of its construction, differences from standard benchmarks, or evaluation protocol in the main text.
  2. [Method] Notation for segment sampling and guidance injection could be clarified with a concise algorithm box or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each of the major comments below and have updated the manuscript to incorporate additional analyses and clarifications as suggested.

read point-by-point responses
  1. Referee: [Method and Experiments] The central claim that self-generated segments supply reliable, non-degrading guidance rests on an unverified premise: no quantitative characterization of latent-space error growth (e.g., drift in continuous codes) within the initial long roll-outs is provided, nor is there an ablation isolating segment quality or a comparison against oracle teacher segments. This directly affects the soundness of the 20x scaling and 4:15 capability assertions.

    Authors: We agree that a quantitative characterization of latent-space error growth would strengthen the central claim. In the revised manuscript we have added measurements of drift in continuous latent codes across initial long roll-outs. We also include a new ablation that isolates segment quality and compares self-generated segments against oracle teacher segments (where short-horizon teacher outputs can be obtained). These additions directly support the soundness of the reported 20x scaling and 4:15 capability. revision: yes

  2. Referee: [Abstract and Experiments] Abstract and results sections: the reported improvements in fidelity and consistency are stated without accompanying quantitative metrics (specific FVD, FID, or temporal consistency scores with error bars and baseline deltas), segment-selection criteria, or ablation studies on how self-guidance avoids the error accumulation it claims to solve.

    Authors: We acknowledge that explicit quantitative metrics, segment-selection criteria, and targeted ablations were insufficiently detailed. The revised manuscript now reports specific FVD, FID, and temporal consistency scores with error bars and baseline deltas. We have also added the segment-selection criteria and ablation studies that isolate how self-guidance mitigates error accumulation. These updates appear in both the abstract and the results sections. revision: yes

  3. Referee: [Results and Scaling Experiments] The 99.9% position-embedding utilization claim is load-bearing for the headline result, yet the manuscript does not detail the mechanism by which self-guidance prevents degradation at this horizon or provide controls (e.g., generation without self-segments) to isolate its contribution.

    Authors: We agree that the mechanism and isolating controls should be made explicit. The revised version explains that self-guidance periodically resets latent-space drift by injecting high-fidelity self-generated segments. We have added control experiments that generate long videos without self-segments; these exhibit rapid degradation, thereby isolating the contribution of self-guidance at the 99.9% position-embedding horizon. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an empirical technique validated experimentally

full rationale

The paper proposes using sampled segments from self-generated long videos as guidance for the student model to extend generation length without long-video teacher data. This is framed as a practical mitigation for error compounding in autoregressive extrapolation, with performance claims (scaling to 99.9% of position-embedding span and 50x baseline) resting on experimental benchmarks rather than any derivation. No equations, parameter fits, or self-citations are shown that reduce the central result to the inputs by construction. The premise that self-samples remain sufficiently high-quality is an empirical assumption subject to external validation, not a self-definitional or fitted-input equivalence. The derivation chain is therefore self-contained against the stated experimental evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that short-horizon teacher knowledge can be transferred via self-samples without external long-video data; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Self-generated segments from the student model remain sufficiently high-quality to provide effective guidance equivalent to a bidirectional teacher.
    Invoked in the description of the core approach to mitigate quality degradation.

pith-pipeline@v0.9.0 · 5593 in / 1246 out tokens · 51830 ms · 2026-05-15T22:36:28.846851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  2. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  3. FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.

  4. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  5. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...

  6. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  7. Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...

  8. DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    cs.RO 2026-02 unverdicted novelty 7.0

    DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...

  9. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  10. Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.

  11. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.

  12. AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

  13. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  14. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  15. Repurposing 3D Generative Model for Autoregressive Layout Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.

  16. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  17. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  18. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  19. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    cs.CV 2025-12 unverdicted novelty 6.0

    WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.

  20. TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.

  21. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

  22. EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

    cs.CV 2026-02 unverdicted novelty 4.0

    EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

  23. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 21 Pith papers · 26 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  4. [4]

    Videojam: Joint appearance-motion representations for en- hanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models. arXiv preprint arXiv:2502.02492, 2025

  5. [5]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

  6. [6]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  7. [7]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

  8. [8]

    Chatbot arena.LMArena, https://lmarena

    Wei-Lin Chiang, Anastasios Angelopoulos, L Zheng, Y Sheng, L Dunlap, C Chou, T Li, E Frick, N Jain, D Li, et al. Chatbot arena.LMArena, https://lmarena. ai, 2024

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  10. [10]

    arXiv preprint arXiv:2412.14169 (2024) 30

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

  11. [11]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025

  12. [12]

    Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

    Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

  13. [13]

    Veo.https://deepmind.google/models/veo/, 2025

    Google DeepMind. Veo.https://deepmind.google/models/veo/, 2025. Accessed: 2025-09-09

  14. [14]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  15. [15]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  16. [16]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022

  17. [17]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

  18. [18]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advancesin neural information processing systems, 35:8633–8646, 2022. 12

  19. [19]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  20. [20]

    Advancing language model reasoning through reinforcement learning and inference scaling

    Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling. arXiv preprint arXiv:2501.11651, 2025

  21. [21]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  22. [22]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  23. [23]

    Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

    Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

  24. [24]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InICLR, 2025

  25. [25]

    Fifo-diffusion: Generating infinite videos from text without training

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training. Advancesin Neural Information Processing Systems, 37:89834–89868, 2024

  26. [26]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  27. [27]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

  28. [28]

    Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745,

    Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025

  29. [29]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  30. [30]

    Training language models to self-correct via reinforcement learning

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

  31. [32]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,

    Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,

  32. [33]

    URLhttps://arxiv.org/abs/2506.17201

  33. [34]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

  34. [35]

    Autoregressive adversarial post-training for real-time interactive video generation

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation.arXiv preprint arXiv:2506.09350, 2025

  35. [36]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  36. [37]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  37. [38]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time, 2025. URLhttps://arxiv.org/abs/2509.25161. 13

  38. [39]

    Videoreasonbench: Can mllms perform vision-centric complex video reasoning?arXiv preprint arXiv:2505.23359, 2025

    Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?arXiv preprint arXiv:2505.23359, 2025

  39. [40]

    arXiv preprint arXiv:2508.15720 , year=

    Zhiheng Liu, Xueqing Deng, Shoufa Chen, Angtian Wang, Qiushan Guo, Mingfei Han, Zeyue Xue, Mengzhao Chen, Ping Luo, and Linjie Yang. Worldweaver: Generating long-horizon video worlds via rich perception.arXiv preprint arXiv:2508.15720, 2025

  40. [41]

    Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

    Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

  41. [42]

    Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

    Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

  42. [43]

    Optical-flow guided prompt optimization for coherent video generation

    Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7837–7846, 2025

  43. [44]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. Technical report, OpenAI, February 2024. URL https://openai.com/index/video-generation-models-as-world-simulators/. Technical report

  44. [45]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

  45. [46]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  46. [47]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

  47. [48]

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in 200 k. arXiv preprint arXiv:2503.09642, 2025

  48. [49]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  49. [50]

    Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

  50. [51]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  51. [52]

    Rolling diffusion models, 2024

    David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models, 2024. URL https://arxiv.org/abs/2402.09470

  52. [53]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  53. [54]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

  54. [55]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  55. [56]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  56. [57]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang 14 Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...

  57. [58]

    Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

    Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

  58. [59]

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

    Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advancesin Neural Information Processing Systems, 37:65618–65642, 2024

  59. [60]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  60. [61]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  61. [62]

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang

    Zhihui Xie, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong, et al. Teaching language models to critique via reinforcement learning.arXiv preprint arXiv:2502.03492, 2025

  62. [63]

    Imagereward: learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

  63. [64]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, and Yuxiao Dong. Visionreward: Fine-grained multi-dimensional human preference learning for image and video genera...

  64. [65]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

  65. [66]

    Longlive: Real-time interactive long video generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, and Song Hanand Yukang Chen. Longlive: Real-time interactive long video generation. 2025

  66. [67]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  67. [68]

    Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

  68. [69]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  69. [70]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

  70. [71]

    Packing input frame context in next-frame prediction models for video generation

    Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025

  71. [72]

    Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

    Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

  72. [73]

    Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advancesin Neural Information Processing Systems, 37:131278–131315, 2024. 15 8 Appendix 8.1 Detailed evaluation results for all dimensions Due to limited space, we...