pith. machine review for the scientific record. sign in

arxiv: 2512.04678 · v2 · submitted 2025-12-04 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords streaming video generationvideo diffusion modelsknowledge distillationexponential moving averagemotion dynamicsreward modelreal-time inference
0
0 comments X

The pith

EMA-updated sink tokens and reward-weighted distillation fix copied frames and weak motion in streaming video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing few-step video diffusion models use sliding-window attention with the first frames held as fixed sink tokens. This design keeps long-range attention stable but causes later frames to duplicate the opening content and lose motion. Reward Forcing replaces the static sinks with EMA-Sink tokens that are continuously refreshed by folding evicted frames into the memory via exponential moving average, preserving both distant context and recent change at zero added cost. The same framework adds Rewarded Distribution Matching Distillation, which scores every training clip for motion strength with a vision-language model and up-weights high-scoring examples so the student model learns lively dynamics rather than averaging across all data equally.

Core claim

Reward Forcing shows that replacing fixed initial-frame sink tokens with exponentially moving-average updates and reweighting the distribution-matching loss toward high-dynamics samples eliminates frame copying, restores motion richness, and delivers state-of-the-art benchmark results together with 23.1 frames per second streaming generation on a single H100 GPU.

What carries the argument

EMA-Sink, a fixed-size set of tokens initialized from the first frames and updated by fusing tokens that exit the sliding attention window through exponential moving average, paired with Re-DMD which reweights the distillation objective according to per-sample motion rewards from a vision-language model.

If this is right

  • Streaming video generation reaches interactive speeds on single-GPU hardware while preserving long-horizon consistency.
  • Distilled models can emphasize dynamic content during training without losing fidelity to the teacher distribution.
  • Standard video benchmarks record state-of-the-art quality under the new EMA-Sink and Re-DMD regime.
  • High-quality real-time video becomes practical for interactive world simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same EMA memory update could replace static context tokens in other long-sequence generators such as audio or 3-D scene models.
  • If the vision-language reward generalizes, the forcing technique may improve motion quality in non-streaming text-to-video systems.
  • Real-time performance enables closed-loop setups where generated video is fed back as new conditioning.
  • Selective emphasis on high-variance samples during distillation offers a general route to increasing output diversity.

Load-bearing premise

The vision-language model that scores motion dynamics supplies ratings that genuinely identify better training examples without introducing new artifacts or shifting the output distribution in unintended ways.

What would settle it

Train the model with Reward Forcing and measure both motion-magnitude metrics and the fraction of frames that closely match the initial frame; if motion scores fail to rise above the baseline or initial-frame duplication remains high, the claimed improvements do not hold.

read the original abstract

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Reward Forcing enables efficient streaming video generation by combining EMA-Sink (exponential moving average updates to fixed-size sink tokens in sliding-window attention to preserve long-term context and recent dynamics without extra cost) with Rewarded Distribution Matching Distillation (Re-DMD), which biases distillation toward high-dynamics samples scored by a vision-language model. This addresses copied initial frames and reduced motion in prior distillation methods, yielding SOTA quantitative/qualitative results on standard benchmarks and 23.1 FPS inference on a single H100 GPU.

Significance. If the VLM reward signal proves reliable and unbiased, the work would meaningfully advance practical streaming video diffusion by mitigating error accumulation and motion degradation in long-horizon generation, with direct relevance to interactive simulation and real-time applications. The reported FPS and benchmark gains, if reproducible, represent a concrete efficiency-quality tradeoff improvement over existing sliding-window distillation baselines.

major comments (2)
  1. [Experiments] Experiments section: The central claim that Re-DMD improves motion dynamics rests on VLM-rated sample prioritization, yet no quantitative validation is provided (e.g., Pearson correlation between VLM scores and optical-flow magnitude, or human preference studies comparing Re-DMD vs. uniform DMD). Without such checks, it remains possible that the VLM introduces systematic bias or artifacts that degrade temporal consistency precisely in the streaming regime.
  2. [§3.2] §3.2 (Re-DMD description): The reward weighting strength is listed as a free hyperparameter, yet the abstract and method claim that Re-DMD 'significantly enhances motion quality while preserving data fidelity' without reporting sensitivity analysis or ablation on this parameter; if performance collapses outside a narrow range, the method's robustness is overstated.
minor comments (2)
  1. [Abstract] The abstract states 'without additional computation cost' for EMA-Sink, but the EMA decay factor is a tunable hyperparameter whose overhead during training should be quantified.
  2. [Tables] Table captions and metric definitions (e.g., exact formulation of the dynamics reward) would benefit from explicit equations to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim that Re-DMD improves motion dynamics rests on VLM-rated sample prioritization, yet no quantitative validation is provided (e.g., Pearson correlation between VLM scores and optical-flow magnitude, or human preference studies comparing Re-DMD vs. uniform DMD). Without such checks, it remains possible that the VLM introduces systematic bias or artifacts that degrade temporal consistency precisely in the streaming regime.

    Authors: We agree that additional quantitative validation would bolster the claims regarding the VLM reward signal. In the revised version, we will add an analysis computing the Pearson correlation between VLM-assigned scores and optical flow magnitudes across a validation set of video samples. Additionally, we will include results from a human preference study involving 50 participants comparing videos generated with Re-DMD against those from uniform DMD, focusing on motion quality and temporal consistency. These additions will help confirm the reliability of the VLM signal and address potential bias concerns. revision: yes

  2. Referee: [§3.2] §3.2 (Re-DMD description): The reward weighting strength is listed as a free hyperparameter, yet the abstract and method claim that Re-DMD 'significantly enhances motion quality while preserving data fidelity' without reporting sensitivity analysis or ablation on this parameter; if performance collapses outside a narrow range, the method's robustness is overstated.

    Authors: The referee is correct that the reward weighting strength λ is a hyperparameter. We chose λ=0.5 based on initial tuning to achieve the reported balance. To demonstrate robustness, the revised manuscript will include an ablation study varying λ from 0.1 to 1.0, reporting metrics such as FVD, motion magnitude, and data fidelity measures. This will show that performance remains stable within a reasonable range around the selected value, supporting the claims without overstatement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel components are independent of inputs

full rationale

The paper's core contributions—EMA-Sink token management via exponential moving average fusion of evicted tokens and Re-DMD which reweights distillation samples by external VLM-rated dynamics—are presented as new designs that extend standard sliding-window distillation without reducing to self-defined parameters or fitted inputs. No equations equate a claimed prediction or output distribution to an internal fit or prior self-citation by construction. The VLM reward signal is treated as an external oracle rather than derived from the model's own outputs, and performance claims rest on empirical benchmarks rather than tautological renaming or uniqueness theorems imported from the same authors. Any self-citations to prior distillation work are non-load-bearing and do not substitute for the novel mechanisms.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that a VLM can serve as a reliable motion-quality oracle and that EMA fusion preserves both consistency and dynamics without new failure modes; no new physical entities are introduced.

free parameters (2)
  • EMA decay factor
    Controls how quickly evicted tokens influence the sink tokens; must be chosen or tuned.
  • Reward weighting strength
    Determines how strongly high-dynamics samples are prioritized during distillation.
axioms (1)
  • domain assumption Vision-language models can reliably quantify motion dynamics in generated video frames
    Invoked to create the reward signal for Re-DMD.

pith-pipeline@v0.9.0 · 5585 in / 1273 out tokens · 27927 ms · 2026-05-16T18:12:24.693015+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Re-DMD biases the model’s output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

  2. DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    DMGD achieves better performance than fine-tuned SOTA methods in dataset distillation on ImageNet subsets by using semantic matching through conditional likelihood optimization and OT-based distribution matching in a ...

  3. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  4. Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...

  5. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  6. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  7. Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...

  8. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  9. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  10. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  11. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  12. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  13. Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.

  14. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  15. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  16. EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

    cs.CV 2026-02 unverdicted novelty 4.0

    EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

  17. Advancing Open-source World Models

    cs.CV 2026-01 unverdicted novelty 4.0

    LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 17 Pith papers · 31 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brown- field, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kris- tian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung...

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 6

  4. [4]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

  5. [5]

    Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025

    Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yun- hong Lu, Chuang Liu, and Bin Wang. Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025. 6

  6. [6]

    Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2, 6

  7. [7]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 2, 6, 7

  8. [8]

    Longanimation: Long animation genera- tion with dynamic global-local memory.arXiv preprint arXiv:2507.01945, 2025

    Nan Chen, Mengqi Huang, Yihao Meng, and Zhen- dong Mao. Longanimation: Long animation genera- tion with dynamic global-local memory.arXiv preprint arXiv:2507.01945, 2025. 6

  9. [9]

    Seine: Short-to-long video diffu- sion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 6

  10. [10]

    Vpo: Aligning text-to-video generation models with prompt optimization.arXiv preprint arXiv:2503.20491,

    Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, et al. Vpo: Aligning text-to-video generation models with prompt optimization.arXiv preprint arXiv:2503.20491,

  11. [11]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025. 2

  12. [12]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 6

  13. [13]

    Autoregressive video generation with- out vector quantization.arXiv preprint arXiv:2412.14169,

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation with- out vector quantization.arXiv preprint arXiv:2412.14169,

  14. [14]

    Inflvg: Reinforce inference-time con- sistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

    Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time con- sistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025. 6

  15. [15]

    The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024. 6

  16. [16]

    Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

    Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Im- proving dynamic object interactions in text-to-video gener- ation with ai feedback.arXiv preprint arXiv:2412.02617,

  17. [17]

    Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024

    Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024. 6

  18. [18]

    Longvie: Multimodal-guided controllable ultra-long video generation

    Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, and Ziwei Liu. Longvie: Multimodal-guided controllable ultra-long video generation. arXiv preprint arXiv:2508.03694, 2025. 2, 6

  19. [19]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

  20. [20]

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long- context autoregressive video modeling with next-frame pre- diction.arXiv preprint arXiv:2503.19325, 2025. 2

  21. [21]

    Long context tuning for video generation,

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhi- jie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025. 6

  22. [22]

    Pre-trained video generative models as world simula- tors.arXiv preprint arXiv:2502.07825, 2025

    Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. Pre-trained video generative models as world simula- tors.arXiv preprint arXiv:2502.07825, 2025. 6

  23. [23]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

  24. [24]

    Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025. 2, 6

  25. [25]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  26. [26]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 6

  27. [27]

    Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 6

  28. [28]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 6

  29. [29]

    A reinforcement learning-based automatic video editing method using pre-trained vision-language model

    Panwen Hu, Nan Xiao, Feifei Li, Yongquan Chen, and Rui Huang. A reinforcement learning-based automatic video editing method using pre-trained vision-language model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6441–6450, 2023. 6

  30. [30]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train- test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 4, 6, 7, 1, 5

  31. [31]

    VBench: Com- prehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Reco...

  32. [32]

    VBench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying- Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Zi- wei Liu. VBench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024. 6, 7

  33. [33]

    Pyramidal flow matching for efficient video generative modeling,

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

  34. [34]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022. 3

  35. [35]

    Vip: Iterative online pref- erence distillation for efficient video diffusion models

    Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, and Youngjae Yu. Vip: Iterative online pref- erence distillation for efficient video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17235–17245, 2025. 6

  36. [36]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 6

  37. [37]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 6

  38. [38]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 5

  39. [39]

    Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

    Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025. 2

  40. [40]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Li- uhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024. 6

  41. [41]

    Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

  42. [42]

    Contentv: Efficient training of video generation models with limited compute.arXiv preprint arXiv:2506.05343, 2025

    Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, et al. Contentv: Efficient training of video generation models with limited compute.arXiv preprint arXiv:2506.05343, 2025. 6

  43. [43]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2

  44. [44]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 2, 5, 6

  45. [45]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 2, 6

  46. [46]

    Videodpo: Omni- preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 2, 6

  47. [47]

    Freelong: Training-free long video generation with spectralblend tem- poral attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024

    Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend tem- poral attention.Advances in Neural Information Processing Systems, 37:131434–131455, 2024. 6

  48. [48]

    Inpo: Inversion preference optimization with reparametrized ddim for efficient diffusion model alignment

    Yunhong Lu, Qichao Wang, Hengyuan Cao, Xierui Wang, Xiaoyin Xu, and Min Zhang. Inpo: Inversion preference optimization with reparametrized ddim for efficient diffusion model alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28629–28639, 2025. 2

  49. [49]

    Smoothed preference optimization via renoise inversion for aligning diffusion models with varied human preferences.arXiv preprint arXiv:2506.02698, 2025

    Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, and Min Zhang. Smoothed preference optimization via renoise inversion for aligning diffusion models with varied human preferences.arXiv preprint arXiv:2506.02698, 2025. 2

  50. [50]

    Video to video generative adversarial network for few-shot learning based on policy gradient.arXiv preprint arXiv:2410.20657, 2024

    Yintai Ma, Diego Klabjan, and Jean Utke. Video to video generative adversarial network for few-shot learning based on policy gradient.arXiv preprint arXiv:2410.20657, 2024. 6

  51. [51]

    T.K. Moon. The expectation-maximization algorithm.IEEE Signal Processing Magazine, 13(6):47–60, 1996. 5

  52. [52]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,

  53. [53]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  54. [54]

    Reinforcement learning by reward-weighted regression for operational space control

    Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Ma- chine Learning, page 745–750, New York, NY , USA, 2007. Association for Computing Machinery. 5

  55. [55]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  56. [56]

    Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

    Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

  57. [57]

    Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024. 2

  58. [58]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 6

  59. [59]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  60. [60]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2

  61. [61]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 6

  62. [62]

    Motion- stream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

    Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Schechtman, and Xun Huang. Motion- stream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025. 2

  63. [63]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  64. [64]

    History-Guided Video Diffusion

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion.arXiv preprint arXiv:2502.06764, 2025. 2, 6

  65. [65]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  66. [66]

    Swiftvideo: A unified framework for few-step video generation through trajectory-distribution alignment.arXiv preprint arXiv:2508.06082, 2025

    Yanxiao Sun, Jiafu Wu, Yun Cao, Chengming Xu, Yabiao Wang, Weijian Cao, Donghao Luo, Chengjie Wang, and Yan- wei Fu. Swiftvideo: A unified framework for few-step video generation through trajectory-distribution alignment.arXiv preprint arXiv:2508.06082, 2025. 6

  67. [67]

    Sutton and A.G

    R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction.IEEE Transactions on Neural Networks, 9(5): 1054–1054, 1998. 2, 3, 4

  68. [68]

    Longcat-video techni- cal report.arXiv preprint arXiv:2510.22200, 2025

    Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video techni- cal report.arXiv preprint arXiv:2510.22200, 2025. 6

  69. [69]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video genera- tion at scale.arXiv preprint arXiv:2505.13211, 2025. 2, 3, 6

  70. [70]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4, 6

  71. [71]

    Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022. 6

  72. [72]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 6

  73. [73]

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion mod- els.Advances in Neural Information Processing Systems, 37: 65618–65642, 2024

    Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion mod- els.Advances in Neural Information Processing Systems, 37: 65618–65642, 2024. 6

  74. [74]

    Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 6

  75. [75]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 6

  76. [76]

    Video models are zero-shot learners and reasoners

    Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 2

  77. [77]

    Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

    Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025. 6

  78. [78]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 3

  79. [79]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 2, 6

  80. [80]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 6

Showing first 80 references.