pith. sign in

arxiv: 2605.15190 · v1 · pith:OSSELZHAnew · submitted 2026-05-14 · 💻 cs.CV

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Pith reviewed 2026-06-30 20:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive video generationcausal video diffusionconsistency modelsreinforcement learningreal-time extrapolationself-rollout repackingdistribution alignment
0
0 comments X

The pith

RAVEN aligns training attention with inference-time extrapolation in causal video diffusion by repacking self-rollouts into interleaved clean and noisy sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the persistent gap between training history distributions and inference-time histories limits quality in causal autoregressive video diffusion models over long horizons. RAVEN closes this gap by repacking each self-rollout into an interleaved sequence of clean historical endpoints and noisy denoising states, allowing chunk losses to directly supervise the history representations used for future predictions. This alignment supports real-time streaming generation where future chunks are extrapolated from previously generated content. The work also introduces CM-GRPO, which casts consistency sampling as a conditional Gaussian transition to enable direct online reinforcement learning on the kernel. Experiments show that RAVEN outperforms recent causal video distillation baselines on quality, semantic, and dynamic degree metrics, with additional gains when combined with CM-GRPO.

Core claim

RAVEN is a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. CM-GRPO reformulates a consistency sampling step as a conditional Gaussian transition and applies online RL directly to this kernel.

What carries the argument

Repacking of self-rollouts into interleaved sequences of clean historical endpoints and noisy denoising states, which aligns training attention patterns with those at inference.

If this is right

  • RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations.
  • CM-GRPO provides further performance gains when combined with RAVEN.
  • The method enables higher-quality real-time streaming video generation by extrapolating future chunks from generated history.
  • Chunk losses can now supervise history representations that future predictions depend on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The interleaving idea could be tested in autoregressive generation tasks outside video, such as audio or text sequences, to reduce similar training-inference mismatches.
  • The framework might scale to longer video horizons where distribution shift typically grows most severe.
  • Direct RL on the consistency kernel might combine with other sampling accelerations beyond the reported experiments.

Load-bearing premise

Repacking self-rollouts into interleaved clean and noisy sequences aligns training attention with inference-time extrapolation without introducing new distribution shifts or training instabilities that offset the gains.

What would settle it

An ablation that removes or randomizes the interleaving step during training and measures whether the reported gains in long-horizon quality, semantic consistency, and dynamic degree disappear.

Figures

Figures reproduced from arXiv: 2605.15190 by Jiankang Deng, Ronglai Zuo, Yanzuo Lu.

Figure 1
Figure 1. Figure 1: Attention Mask Configuration. Autoregressive video diffusion training paradigms differ in how historical states enter attention and whether those states receive end-to-end supervision from later chunks. Teacher Forcing and Diffusion Forcing rely on data-driven historical states, inducing a training distribution that differs from inference. Self Forcing shifts the history distribution toward inference but r… view at source ↗
Figure 2
Figure 2. Figure 2: Training Pipeline. RAVEN builds on score distillation with a training-time test formulation that aligns the generator’s training context with inference. In the fake-score step, the frozen generator performs autoregressive self rollout with KV cache reuse, producing the clean endpoints and noisy denoising states that are subsequently reused in the generator step. Rather than discarding these rollout states … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons. See supplementary for playble video clips. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation on Training-time Test. See supplementary for playable video clips. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on Chunk-wise Loss Scaling [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User study preference rates on Quality, Semantic, and Overall. B More Implementation Details Dataset. Both RAVEN and CM-GRPO are trained exclusively on text prompts drawn from Vid￾ProM [124], preprocessed through filtering and large language model extension, following the data protocol of Self Forcing [31]. Ablation experiments that require real video data draw from OpenVidHD-0.4M [125], with all video cli… view at source ↗
Figure 7
Figure 7. Figure 7: User study instruction screenshot. D Discussion Although RAVEN and CM-GRPO are presented with concrete design choices tailored to causal autoregressive video distillation, both formulations admit broader scope than the setting evaluated in our experiments. The interleaved sequence construction underlying RAVEN currently treats clean chunks as historical context, yet the supervised forward pass does not res… view at source ↗
read the original abstract

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self-rollout into an interleaved sequence of clean historical endpoints and noisy denoising states to align training attention with inference-time extrapolation in causal autoregressive video diffusion models. It further proposes Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online RL directly to this kernel, avoiding the Euler-Maruyama auxiliary process. Experiments are asserted to show that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, with additional gains when combined with CM-GRPO.

Significance. If validated, the work could advance real-time streaming video generation by mitigating history distribution gaps that limit long-horizon quality in autoregressive models. The CM-GRPO reformulation offers a direct RL approach on consistency kernels, which is a technical distinction from prior flow-model RL methods. No machine-checked proofs or reproducible code are mentioned, but the focus on aligning training and inference distributions is a relevant direction if the alignment is rigorously shown.

major comments (2)
  1. [RAVEN description] The RAVEN formulation asserts that repacking self-rollouts into interleaved sequences of clean historical endpoints and noisy denoising states aligns training attention with inference-time extrapolation and enables chunk losses to supervise history representations, but provides no derivation showing that the resulting joint distribution over (clean, noisy) pairs matches inference trajectories. This is load-bearing for the central claim, as unanalyzed distribution shifts from the interleaving could offset the reported gains rather than achieve the intended alignment.
  2. [Experiments] The abstract asserts experimental superiority over causal video distillation baselines on quality, semantic, and dynamic degree evaluations but supplies no metrics, dataset details, ablation studies, or implementation specifics. This absence prevents verification of the magnitude or reliability of the claimed improvements and of whether CM-GRPO provides further gains.
minor comments (1)
  1. The acronym 'CM-GRPO' is expanded as 'Consistency-model Group Relative Policy Optimization' in the text; ensure the title's 'Consistency-model GRPO' is consistent or clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [RAVEN description] The RAVEN formulation asserts that repacking self-rollouts into interleaved sequences of clean historical endpoints and noisy denoising states aligns training attention with inference-time extrapolation and enables chunk losses to supervise history representations, but provides no derivation showing that the resulting joint distribution over (clean, noisy) pairs matches inference trajectories. This is load-bearing for the central claim, as unanalyzed distribution shifts from the interleaving could offset the reported gains rather than achieve the intended alignment.

    Authors: We agree with the referee that a formal derivation is absent from the current manuscript. The interleaving strategy is motivated by the need to match the attention patterns, but without an explicit proof that the joint distribution is preserved, the alignment claim remains heuristic. We will add a detailed derivation in the revised version demonstrating that the repacked training distribution matches the inference trajectories under the causal autoregressive setting. revision: yes

  2. Referee: [Experiments] The abstract asserts experimental superiority over causal video distillation baselines on quality, semantic, and dynamic degree evaluations but supplies no metrics, dataset details, ablation studies, or implementation specifics. This absence prevents verification of the magnitude or reliability of the claimed improvements and of whether CM-GRPO provides further gains.

    Authors: The referee is correct that neither the abstract nor the provided manuscript text includes specific metrics, dataset details, or ablations. We will revise the manuscript to include a full Experiments section with quantitative results (e.g., specific scores on quality metrics), details on the datasets used, ablation studies isolating the contributions of RAVEN and CM-GRPO, and implementation specifics to substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: method claims rest on explicit reformulations without reduction to inputs

full rationale

The provided abstract and description contain no equations, fitted parameters renamed as predictions, or self-citations that bear the central load. RAVEN's repacking of self-rollouts and CM-GRPO's reformulation of consistency sampling as a conditional Gaussian are presented as design choices whose alignment benefits are asserted via experiment, not derived by construction from the inputs themselves. No step reduces a claimed result to a tautology or prior self-work that is itself unverified. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.1-grok · 5731 in / 1175 out tokens · 24258 ms · 2026-06-30T20:35:55.465619+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

    cs.CV 2026-06 unverdicted novelty 6.0

    MoVerse generates real-time interactive video world models from single narrow-FOV images via panoramic diffusion expansion, Gaussian scaffold lifting, and distillation of a bidirectional diffusion teacher into a causa...

Reference graph

Works this paper leans on

131 extracted references · 69 canonical work pages · cited by 1 Pith paper · 26 internal anchors

  1. [1]

    Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

    Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, et al. Kandinsky 5.0: A family of foundation models for image and video generation.arXiv preprint arXiv:2511.14993, 2025

  2. [2]

    Sana-video: Efficient video generation with block linear diffusion transformer

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, et al. Sana-video: Efficient video generation with block linear diffusion transformer. InICLR, 2026

  3. [3]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  4. [4]

    Mochi 1: Ai video generator

    Genmo. Mochi 1: Ai video generator. https://www.genmo.ai/blog/mochi-1-a-new-sota-in-o pen-text-to-video, 2024

  5. [5]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  6. [6]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

  7. [7]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2025

  8. [8]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

  9. [9]

    Movie gen: A cast of media foundation models

    Meta. Movie gen: A cast of media foundation models. https://ai.meta.com/static-resource/ movie-gen-research-paper, 2024

  10. [10]

    Cosmos world foundation model platform for physical ai

    NVIDIA. Cosmos world foundation model platform for physical ai. https://research.nvidia.co m/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai, 2025

  11. [11]

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  12. [12]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  13. [13]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  14. [14]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  15. [15]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 10

  16. [16]

    Sand ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  17. [17]

    Causality in video diffusers is separable from denoising.arXiv preprint arXiv:2602.10095, 2026

    Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, and Zongze Wu. Causality in video diffusers is separable from denoising.arXiv preprint arXiv:2602.10095, 2026

  18. [18]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  19. [19]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025

  20. [20]

    End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

    Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

  21. [21]

    Live: Long-horizon interactive video world modeling,

    Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

  22. [22]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InICLR, 2025

  23. [23]

    Stable video infinity: Infinite- length video generation with error recycling

    Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling. InICLR, 2026

  24. [24]

    Infinitystar: Unified spacetime autoregressive modeling for visual generation

    Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Infinitystar: Unified spacetime autoregressive modeling for visual generation. InNeurIPS, 2025

  25. [25]

    Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models.arXiv preprint arXiv:2512.12080, 2025

    Ryan Po, Eric Ryan Chan, Changan Chen, and Gordon Wetzstein. Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models.arXiv preprint arXiv:2512.12080, 2025

  26. [26]

    Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

    Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

  27. [27]

    Macro-from-micro planning for high-quality and parallelized autoregressive long video generation.arXiv preprint arXiv:2508.03334, 2025

    Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, et al. Macro-from-micro planning for high-quality and parallelized autoregressive long video generation.arXiv preprint arXiv:2508.03334, 2025

  28. [28]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

    Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

  29. [29]

    BIFE: Better Interaction, Fewer Errors for Minute-Long Video Generation

    Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

  30. [30]

    TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

    Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation for lightweight autoregressive video history embedding.arXiv preprint arXiv:2512.23851, 2026

  31. [31]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025

  32. [32]

    Rolling forcing: Autoregressive long video diffusion in real time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026

  33. [33]

    Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, and Min Zhang. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. InCVPR, 2026

  34. [34]

    Longlive: Real-time interactive long video generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation. InICLR, 2026

  35. [35]

    Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025. 11

  36. [36]

    Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation. InICML, 2026

  37. [37]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

  38. [38]

    History- guided video diffusion

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion. InICML, 2025

  39. [39]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, pages 6613–6623, 2024

  40. [40]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

  41. [41]

    Flow-grpo: Training flow matching models via online rl

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, 2025

  42. [42]

    Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

    Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

  43. [43]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

  44. [44]

    Streaming autoregressive video generation via diagonal distillation

    Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation. InICLR, 2026

  45. [45]

    Hiar: Efficient autoregressive long video generation via hierarchical denoising.arXiv preprint arXiv:2603.08703, 2026

    Kai Zou, Dian Zheng, Hongbo Liu, Tiankai Hang, Bin Liu, and Nenghai Yu. Hiar: Efficient autoregressive long video generation via hierarchical denoising.arXiv preprint arXiv:2603.08703, 2026

  46. [46]

    Past- and future-informed kv cache policy with salience estimation in autoregressive video diffusion.arXiv preprint arXiv:2601.21896, 2026

    Hanmo Chen, Chenghao Xu, Xu Yang, Xuan Chen, and Cheng Deng. Past- and future-informed kv cache policy with salience estimation in autoregressive video diffusion.arXiv preprint arXiv:2601.21896, 2026

  47. [47]

    Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

    Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

  48. [48]

    Motionstream: Real-time video generation with interactive motion controls

    Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls. InICLR, 2026

  49. [49]

    Videossm: Autoregressive long video gen- eration with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

    Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

  50. [50]

    Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

    Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

  51. [51]

    Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026

  52. [52]

    Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

    Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, and Hayden Kwok-Hay So. Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026

  53. [53]

    Pathwise test-time correction for autoregressive long video generation.arXiv preprint arXiv:2602.05871, 2026

    Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, and Chunchao Guo. Pathwise test-time correction for autoregressive long video generation. arXiv preprint arXiv:2602.05871, 2026

  54. [54]

    Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout

    Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout. InCVPR, 2026

  55. [55]

    arXiv preprint arXiv:2512.05081 (2025)

    Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025. 12

  56. [56]

    Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

    Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

  57. [57]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024

  58. [58]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2026

  59. [59]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

  60. [60]

    Diffusionnft: Online diffusion reinforcement with forward process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process. InICLR, 2026

  61. [61]

    Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

    Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

  62. [62]

    Worldcompass: Reinforce- ment learning for long-horizon world models,

    Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, Chunchao Guo, and Zhou Zhao. Worldcompass: Reinforcement learning for long-horizon world models.arXiv preprint arXiv:2602.09022, 2026

  63. [63]

    Rlvr-world: Training world models with reinforcement learning

    Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning. InICLR, 2025

  64. [64]

    Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025

    Yang Ye, Tianyu He, Shuo Yang, and Jiang Bian. Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025

  65. [65]

    Ma, Haoyang Huang, Nan Duan, and Anyi Rao

    Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y . Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026

  66. [66]

    Real-time motion-controllable autoregressive video diffusion

    Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, and Hanwang Zhang. Real-time motion-controllable autoregressive video diffusion. InICLR, 2026

  67. [67]

    Flash-dmd: Towards high-fidelity few-step image generation with efficient distillation and joint reinforcement learning.arXiv preprint arXiv:2511.20549, 2025

    Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, and Yifu Sun. Flash-dmd: Towards high-fidelity few-step image generation with efficient distillation and joint reinforcement learning.arXiv preprint arXiv:2511.20549, 2025

  68. [68]

    Erudiff: Refactoring knowledge in diffusion models for advanced text-to-image synthesis.arXiv preprint arXiv:2603.20828, 2026

    Xiefan Guo, Xinzhu Ma, Haoxiang Ma, Zihao Zhou, and Di Huang. Erudiff: Refactoring knowledge in diffusion models for advanced text-to-image synthesis.arXiv preprint arXiv:2603.20828, 2026

  69. [69]

    Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward.arXiv preprint arXiv:2603.07700, 2026

    Yihong Luo, Tianyang Hu, Weijian Luo, and Jing Tang. Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward.arXiv preprint arXiv:2603.07700, 2026

  70. [70]

    Gardo: Reinforcing diffusion models without reward hacking.arXiv preprint arXiv:2512.24138, 2025

    Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, and Ling Pan. Gardo: Reinforcing diffusion models without reward hacking.arXiv preprint arXiv:2512.24138, 2025

  71. [71]

    Unigrpo: Unified policy optimization for reasoning-driven visual generation.arXiv preprint arXiv:2603.23500, 2026

    Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, and Wanli Ouyang. Unigrpo: Unified policy optimization for reasoning-driven visual generation.arXiv preprint arXiv:2603.23500, 2026

  72. [72]

    Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

    Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, et al. Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025

  73. [73]

    Diffusion reinforcement learning via centered reward distillation.arXiv preprint arXiv:2603.14128, 2026

    Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Diffusion reinforcement learning via centered reward distillation.arXiv preprint arXiv:2603.14128, 2026

  74. [74]

    Neighbor grpo: Contrastive ode policy optimization aligns flow models

    Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, and Hongsheng Li. Neighbor grpo: Contrastive ode policy optimization aligns flow models. InCVPR, 2026

  75. [75]

    Reinforcing diffusion models by direct group preference optimization

    Yihong Luo, Tianyang Hu, and Jing Tang. Reinforcing diffusion models by direct group preference optimization. InICLR, 2026. 13

  76. [76]

    Yao, and Wenpin Tang

    Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, and Wenpin Tang. Understanding sampler stochasticity in training diffusion models for rlhf.arXiv preprint arXiv:2510.10767, 2025

  77. [77]

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang

    Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching. arXiv preprint arXiv:2509.05952, 2025

  78. [78]

    Pc-flow: Preference alignment in flow matching via classifier

    Shaomeng Wang, He Wang, Longquan Dai, and Jinhui Tang. Pc-flow: Preference alignment in flow matching via classifier. InAAAI, 2026

  79. [79]

    E-grpo: High entropy steps drive effective reinforcement learning for flow models

    Shengjun Zhang, Zhang Zhang, Chensheng Dai, and Yueqi Duan. E-grpo: High entropy steps drive effective reinforcement learning for flow models. InCVPR Findings, 2026

  80. [80]

    Manifold-aware exploration for reinforcement learning in video generation.arXiv preprint arXiv:2603.21872, 2026

    Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, and Harry Yang. Manifold-aware exploration for reinforcement learning in video generation.arXiv preprint arXiv:2603.21872, 2026

Showing first 80 references.