arxiv: 2510.02283 · v1 · submitted 2025-10-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui , Jie Wu , Ming Li , Tao Yang , Xiaojie Li , Rui Wang , Andrew Bai , Yuanhao Ban

show 1 more author

Cho-Jui Hsieh

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationdiffusion modelslong videoautoregressive generationself-guidancetemporal consistencyerror accumulation

0 comments

The pith

Self-generated segments from a video model steer it to produce coherent four-minute clips without long-video teachers or retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion-based video generators degrade when asked to continue past the short horizons their teacher models can supervise. This paper shows that sampling short segments from the student model's own extended outputs supplies enough guidance to keep latent-space errors from compounding. The resulting Self-Forcing++ procedure therefore scales output length by roughly twenty times the teacher's limit while preserving frame-to-frame consistency. No overlapping-frame recomputation or new long-video datasets are required. The approach reaches 4 minutes 15 seconds of video, which is 99.9 percent of the base model's position-embedding capacity and more than fifty times the length of the prior baseline.

Core claim

By drawing guidance segments directly from the student model's own long self-generated videos, the method supplies reliable teacher-like signals that prevent error accumulation in the continuous latent space, thereby enabling high-fidelity video generation up to 4 minutes 15 seconds long without any long-horizon teacher or additional training data.

What carries the argument

The self-forcing++ loop that repeatedly samples short segments from the model's own extended generations and feeds them back as conditioning to maintain consistency across the full sequence.

If this is right

Temporal consistency is preserved while video length scales up to twenty times beyond the teacher's horizon.
Over-exposure and error accumulation are avoided without recomputing overlapping frames.
Generation reaches 99.9 percent of the base model's position-embedding span.
Fidelity and consistency scores exceed those of prior baselines on both standard and newly proposed benchmarks.
No retraining on long-video datasets is needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-sampling idea could be tested on other autoregressive modalities such as long audio or 3-D scene sequences.
If self-guidance works here, it may reduce reliance on ever-larger supervised datasets for extending context in generative models.
Position-embedding limits rather than error accumulation may become the next practical bottleneck once this technique is applied.

Load-bearing premise

Segments taken from the model's own long outputs remain high-quality enough to act as stable guidance without injecting new compounding errors.

What would settle it

Run the method on a target of five minutes and measure whether visual fidelity or temporal consistency drops measurably below the quality achieved at four minutes fifteen seconds.

read the original abstract

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-Forcing++ extends video length via self-sampled guidance segments but leaves open whether those segments truly curb error accumulation or just postpone it.

read the letter

The main point is that this work uses segments drawn from the model's own long autoregressive rollouts to steer further generation, letting it reach 4 minutes 15 seconds—99.9% of the position embedding limit and 50x the baseline—without a long-video teacher or overlap recomputation. That self-guidance loop is the concrete extension beyond standard short-teacher distillation. It reports better fidelity and consistency on standard benchmarks plus their improved one, and the demo site shows the length gains in practice. The method keeps the same base model and avoids retraining on long data, which is a practical plus for anyone already running autoregressive video pipelines. The soft spot sits exactly where the stress-test note flags: the guidance comes from the same model known to degrade past the teacher horizon, so any latent drift in the continuous codes gets re-injected. The abstract supplies no quantitative error curves on the self-generated videos, no segment-selection criteria, and no ablation against oracle segments or against simple continuation. Without those, it is hard to tell whether the reported consistency comes from actually breaking the error-compounding cycle or from the selection process quietly favoring the less-degraded parts. The circularity burden is moderate rather than fatal, but it is load-bearing for the central claim. This paper is for groups already working on autoregressive video models who need longer coherent output for media or simulation tasks. If the full manuscript includes the missing ablations and error metrics, the length scaling would be worth citing. I would send it to peer review; the idea is straightforward enough that referees can check the validation gaps directly and the practical length jump is large enough to matter if the numbers hold.

Referee Report

3 major / 2 minor

Summary. The paper proposes Self-Forcing++, a method for long-horizon autoregressive video generation that uses sampled segments from the student model's own self-generated videos as guidance signals. This is intended to mitigate compounding errors in the continuous latent space without access to long-video teacher models or retraining on extended datasets, enabling videos up to 4 minutes 15 seconds (99.9% of the base model's position-embedding span and >50x longer than the baseline) while preserving temporal consistency and visual fidelity.

Significance. If the central premise holds, the work would meaningfully advance scalable video synthesis by demonstrating that self-guidance can substitute for unavailable teacher supervision at extreme lengths, potentially reducing data and compute barriers for minute-scale generation. The reported length extension and benchmark gains would represent a practical step beyond current short-horizon distillation limits.

major comments (3)

[Method and Experiments] The central claim that self-generated segments supply reliable, non-degrading guidance rests on an unverified premise: no quantitative characterization of latent-space error growth (e.g., drift in continuous codes) within the initial long roll-outs is provided, nor is there an ablation isolating segment quality or a comparison against oracle teacher segments. This directly affects the soundness of the 20x scaling and 4:15 capability assertions.
[Abstract and Experiments] Abstract and results sections: the reported improvements in fidelity and consistency are stated without accompanying quantitative metrics (specific FVD, FID, or temporal consistency scores with error bars and baseline deltas), segment-selection criteria, or ablation studies on how self-guidance avoids the error accumulation it claims to solve.
[Results and Scaling Experiments] The 99.9% position-embedding utilization claim is load-bearing for the headline result, yet the manuscript does not detail the mechanism by which self-guidance prevents degradation at this horizon or provide controls (e.g., generation without self-segments) to isolate its contribution.

minor comments (2)

[Abstract and Experiments] The abstract references 'our proposed improved benchmark' without a clear description of its construction, differences from standard benchmarks, or evaluation protocol in the main text.
[Method] Notation for segment sampling and guidance injection could be clarified with a concise algorithm box or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each of the major comments below and have updated the manuscript to incorporate additional analyses and clarifications as suggested.

read point-by-point responses

Referee: [Method and Experiments] The central claim that self-generated segments supply reliable, non-degrading guidance rests on an unverified premise: no quantitative characterization of latent-space error growth (e.g., drift in continuous codes) within the initial long roll-outs is provided, nor is there an ablation isolating segment quality or a comparison against oracle teacher segments. This directly affects the soundness of the 20x scaling and 4:15 capability assertions.

Authors: We agree that a quantitative characterization of latent-space error growth would strengthen the central claim. In the revised manuscript we have added measurements of drift in continuous latent codes across initial long roll-outs. We also include a new ablation that isolates segment quality and compares self-generated segments against oracle teacher segments (where short-horizon teacher outputs can be obtained). These additions directly support the soundness of the reported 20x scaling and 4:15 capability. revision: yes
Referee: [Abstract and Experiments] Abstract and results sections: the reported improvements in fidelity and consistency are stated without accompanying quantitative metrics (specific FVD, FID, or temporal consistency scores with error bars and baseline deltas), segment-selection criteria, or ablation studies on how self-guidance avoids the error accumulation it claims to solve.

Authors: We acknowledge that explicit quantitative metrics, segment-selection criteria, and targeted ablations were insufficiently detailed. The revised manuscript now reports specific FVD, FID, and temporal consistency scores with error bars and baseline deltas. We have also added the segment-selection criteria and ablation studies that isolate how self-guidance mitigates error accumulation. These updates appear in both the abstract and the results sections. revision: yes
Referee: [Results and Scaling Experiments] The 99.9% position-embedding utilization claim is load-bearing for the headline result, yet the manuscript does not detail the mechanism by which self-guidance prevents degradation at this horizon or provide controls (e.g., generation without self-segments) to isolate its contribution.

Authors: We agree that the mechanism and isolating controls should be made explicit. The revised version explains that self-guidance periodically resets latent-space drift by injecting high-fidelity self-generated segments. We have added control experiments that generate long videos without self-segments; these exhibit rapid degradation, thereby isolating the contribution of self-guidance at the 99.9% position-embedding horizon. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an empirical technique validated experimentally

full rationale

The paper proposes using sampled segments from self-generated long videos as guidance for the student model to extend generation length without long-video teacher data. This is framed as a practical mitigation for error compounding in autoregressive extrapolation, with performance claims (scaling to 99.9% of position-embedding span and 50x baseline) resting on experimental benchmarks rather than any derivation. No equations, parameter fits, or self-citations are shown that reduce the central result to the inputs by construction. The premise that self-samples remain sufficiently high-quality is an empirical assumption subject to external validation, not a self-definitional or fitted-input equivalence. The derivation chain is therefore self-contained against the stated experimental evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that short-horizon teacher knowledge can be transferred via self-samples without external long-video data; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Self-generated segments from the student model remain sufficiently high-quality to provide effective guidance equivalent to a bidirectional teacher.
Invoked in the description of the core approach to mitigate quality degradation.

pith-pipeline@v0.9.0 · 5593 in / 1246 out tokens · 51830 ms · 2026-05-15T22:36:28.846851+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Repurposing 3D Generative Model for Autoregressive Layout Generation
cs.CV 2026-04 unverdicted novelty 6.0

LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
cs.CV 2026-04 unverdicted novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
cs.CV 2025-12 unverdicted novelty 6.0

WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
cs.CV 2026-04 unverdicted novelty 5.0

TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
cs.CV 2026-02 unverdicted novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 19 Pith papers · 23 internal anchors

[1]

Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024

work page 2024
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024
[4]

Videojam: Joint appearance-motion representations for enhanced motion generation in video models

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models. arXiv preprint arXiv:2502.02492, 2025

work page arXiv 2025
[5]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024

work page 2024
[6]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Chatbot arena.LMArena, https://lmarena

Wei-Lin Chiang, Anastasios Angelopoulos, L Zheng, Y Sheng, L Dunlap, C Chou, T Li, E Frick, N Jain, D Li, et al. Chatbot arena.LMArena, https://lmarena. ai, 2024

work page 2024
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

work page arXiv 2024
[11]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025

work page 2025
[12]

Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

work page arXiv 2025
[13]

Veo.https://deepmind.google/models/veo/, 2025

Google DeepMind. Veo.https://deepmind.google/models/veo/, 2025. Accessed: 2025-09-09

work page 2025
[14]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advancesin neural information processing systems, 35:8633–8646, 2022. 12

work page 2022
[19]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Advancing language model reasoning through reinforcement learning and inference scaling

Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling. arXiv preprint arXiv:2501.11651, 2025

work page arXiv 2025
[21]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[23]

Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

work page arXiv 2025
[24]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InICLR, 2025

work page 2025
[25]

Fifo-diffusion: Generating infinite videos from text without training

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training. Advancesin Neural Information Processing Systems, 37:89834–89868, 2024

work page 2024
[26]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[27]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

work page 2023
[28]

Streamdit: Real-time streaming text-to-video generation

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025

work page arXiv 2025
[29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

work page arXiv 2024
[32]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,

work page
[33]

URLhttps://arxiv.org/abs/2506.17201

work page arXiv
[34]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page arXiv 2024
[35]

Autoregressive adversarial post-training for real-time interactive video generation.arXiv preprint arXiv:2506.09350, 2025

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation.arXiv preprint arXiv:2506.09350, 2025

work page arXiv 2025
[36]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Rolling forcing: Autoregressive long video diffusion in real time, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time, 2025. URLhttps://arxiv.org/abs/2509.25161. 13

work page arXiv 2025
[39]

Videoreasonbench: Can mllms perform vision-centric complex video reasoning?arXiv preprint arXiv:2505.23359, 2025

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?arXiv preprint arXiv:2505.23359, 2025

work page arXiv 2025
[40]

Worldweaver: Generating long-horizon video worlds via rich perception.arXiv preprint arXiv:2508.15720, 2025

Zhiheng Liu, Xueqing Deng, Shoufa Chen, Angtian Wang, Qiushan Guo, Mingfei Han, Zeyue Xue, Mengzhao Chen, Ping Luo, and Linjie Yang. Worldweaver: Generating long-horizon video worlds via rich perception.arXiv preprint arXiv:2508.15720, 2025

work page arXiv 2025
[41]

Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

work page arXiv 2025
[42]

Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

work page arXiv 2025
[43]

Optical-flow guided prompt optimization for coherent video generation

Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7837–7846, 2025

work page 2025
[44]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. Technical report, OpenAI, February 2024. URL https://openai.com/index/video-generation-models-as-world-simulators/. Technical report

work page 2024
[45]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

work page 2022
[46]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[47]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Open-sora 2.0: Training a commercial-level video generation model in 200 k

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in 200 k. arXiv preprint arXiv:2503.09642, 2025

work page arXiv 2025
[49]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

work page 2023
[51]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[52]

Rolling diffusion models, 2024

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models, 2024. URL https://arxiv.org/abs/2402.09470

work page arXiv 2024
[53]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

work page 2023
[55]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[57]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang 14 Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

work page 2024
[59]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advancesin Neural Information Processing Systems, 37:65618–65642, 2024

work page 2024
[60]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Teaching language models to critique via reinforcement learning.arXiv preprint arXiv:2502.03492, 2025

Zhihui Xie, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong, et al. Teaching language models to critique via reinforcement learning.arXiv preprint arXiv:2502.03492, 2025

work page arXiv 2025
[63]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

work page 2023
[64]

Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation, 2024

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, and Yuxiao Dong. Visionreward: Fine-grained multi-dimensional human preference learning for image and video genera...

work page arXiv 2024
[65]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Longlive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, and Song Hanand Yukang Chen. Longlive: Real-time interactive long video generation. 2025

work page 2025
[67]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

work page 2024
[69]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[70]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

work page 2025
[71]

Packing input frame context in next-frame prediction models for video generation

Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025

work page arXiv 2025
[72]

Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

work page arXiv 2025
[73]

Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advancesin Neural Information Processing Systems, 37:131278–131315, 2024. 15 8 Appendix 8.1 Detailed evaluation results for all dimensions Due to limited space, we...

work page 2024