RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
Pith reviewed 2026-06-30 20:35 UTC · model grok-4.3
The pith
RAVEN aligns training attention with inference-time extrapolation in causal video diffusion by repacking self-rollouts into interleaved clean and noisy sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAVEN is a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. CM-GRPO reformulates a consistency sampling step as a conditional Gaussian transition and applies online RL directly to this kernel.
What carries the argument
Repacking of self-rollouts into interleaved sequences of clean historical endpoints and noisy denoising states, which aligns training attention patterns with those at inference.
If this is right
- RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations.
- CM-GRPO provides further performance gains when combined with RAVEN.
- The method enables higher-quality real-time streaming video generation by extrapolating future chunks from generated history.
- Chunk losses can now supervise history representations that future predictions depend on.
Where Pith is reading between the lines
- The interleaving idea could be tested in autoregressive generation tasks outside video, such as audio or text sequences, to reduce similar training-inference mismatches.
- The framework might scale to longer video horizons where distribution shift typically grows most severe.
- Direct RL on the consistency kernel might combine with other sampling accelerations beyond the reported experiments.
Load-bearing premise
Repacking self-rollouts into interleaved clean and noisy sequences aligns training attention with inference-time extrapolation without introducing new distribution shifts or training instabilities that offset the gains.
What would settle it
An ablation that removes or randomizes the interleaving step during training and measures whether the reported gains in long-horizon quality, semantic consistency, and dynamic degree disappear.
Figures
read the original abstract
Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self-rollout into an interleaved sequence of clean historical endpoints and noisy denoising states to align training attention with inference-time extrapolation in causal autoregressive video diffusion models. It further proposes Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online RL directly to this kernel, avoiding the Euler-Maruyama auxiliary process. Experiments are asserted to show that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, with additional gains when combined with CM-GRPO.
Significance. If validated, the work could advance real-time streaming video generation by mitigating history distribution gaps that limit long-horizon quality in autoregressive models. The CM-GRPO reformulation offers a direct RL approach on consistency kernels, which is a technical distinction from prior flow-model RL methods. No machine-checked proofs or reproducible code are mentioned, but the focus on aligning training and inference distributions is a relevant direction if the alignment is rigorously shown.
major comments (2)
- [RAVEN description] The RAVEN formulation asserts that repacking self-rollouts into interleaved sequences of clean historical endpoints and noisy denoising states aligns training attention with inference-time extrapolation and enables chunk losses to supervise history representations, but provides no derivation showing that the resulting joint distribution over (clean, noisy) pairs matches inference trajectories. This is load-bearing for the central claim, as unanalyzed distribution shifts from the interleaving could offset the reported gains rather than achieve the intended alignment.
- [Experiments] The abstract asserts experimental superiority over causal video distillation baselines on quality, semantic, and dynamic degree evaluations but supplies no metrics, dataset details, ablation studies, or implementation specifics. This absence prevents verification of the magnitude or reliability of the claimed improvements and of whether CM-GRPO provides further gains.
minor comments (1)
- The acronym 'CM-GRPO' is expanded as 'Consistency-model Group Relative Policy Optimization' in the text; ensure the title's 'Consistency-model GRPO' is consistent or clarified.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below.
read point-by-point responses
-
Referee: [RAVEN description] The RAVEN formulation asserts that repacking self-rollouts into interleaved sequences of clean historical endpoints and noisy denoising states aligns training attention with inference-time extrapolation and enables chunk losses to supervise history representations, but provides no derivation showing that the resulting joint distribution over (clean, noisy) pairs matches inference trajectories. This is load-bearing for the central claim, as unanalyzed distribution shifts from the interleaving could offset the reported gains rather than achieve the intended alignment.
Authors: We agree with the referee that a formal derivation is absent from the current manuscript. The interleaving strategy is motivated by the need to match the attention patterns, but without an explicit proof that the joint distribution is preserved, the alignment claim remains heuristic. We will add a detailed derivation in the revised version demonstrating that the repacked training distribution matches the inference trajectories under the causal autoregressive setting. revision: yes
-
Referee: [Experiments] The abstract asserts experimental superiority over causal video distillation baselines on quality, semantic, and dynamic degree evaluations but supplies no metrics, dataset details, ablation studies, or implementation specifics. This absence prevents verification of the magnitude or reliability of the claimed improvements and of whether CM-GRPO provides further gains.
Authors: The referee is correct that neither the abstract nor the provided manuscript text includes specific metrics, dataset details, or ablations. We will revise the manuscript to include a full Experiments section with quantitative results (e.g., specific scores on quality metrics), details on the datasets used, ablation studies isolating the contributions of RAVEN and CM-GRPO, and implementation specifics to substantiate the claims. revision: yes
Circularity Check
No circularity: method claims rest on explicit reformulations without reduction to inputs
full rationale
The provided abstract and description contain no equations, fitted parameters renamed as predictions, or self-citations that bear the central load. RAVEN's repacking of self-rollouts and CM-GRPO's reformulation of consistency sampling as a conditional Gaussian are presented as design choices whose alignment benefits are asserted via experiment, not derived by construction from the inputs themselves. No step reduces a claimed result to a tautology or prior self-work that is itself unverified. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold
MoVerse generates real-time interactive video world models from single narrow-FOV images via panoramic diffusion expansion, Gaussian scaffold lifting, and distillation of a bidirectional diffusion teacher into a causa...
Reference graph
Works this paper leans on
-
[1]
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, et al. Kandinsky 5.0: A family of foundation models for image and video generation.arXiv preprint arXiv:2511.14993, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Sana-video: Efficient video generation with block linear diffusion transformer
Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, et al. Sana-video: Efficient video generation with block linear diffusion transformer. InICLR, 2026
2026
-
[3]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Mochi 1: Ai video generator
Genmo. Mochi 1: Ai video generator. https://www.genmo.ai/blog/mochi-1-a-new-sota-in-o pen-text-to-video, 2024
2024
-
[5]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Movie gen: A cast of media foundation models
Meta. Movie gen: A cast of media foundation models. https://ai.meta.com/static-resource/ movie-gen-research-paper, 2024
2024
-
[10]
Cosmos world foundation model platform for physical ai
NVIDIA. Cosmos world foundation model platform for physical ai. https://research.nvidia.co m/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai, 2025
2025
-
[11]
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Seedance 2.0: Advancing Video Generation for World Complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 10
2025
-
[16]
Sand ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Causality in video diffusers is separable from denoising.arXiv preprint arXiv:2602.10095, 2026
Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, and Zongze Wu. Causality in video diffusers is separable from denoising.arXiv preprint arXiv:2602.10095, 2026
-
[18]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Autoregressive video generation without vector quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025
2025
-
[20]
Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025
-
[21]
Live: Long-horizon interactive video world modeling,
Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026
-
[22]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InICLR, 2025
2025
-
[23]
Stable video infinity: Infinite- length video generation with error recycling
Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling. InICLR, 2026
2026
-
[24]
Infinitystar: Unified spacetime autoregressive modeling for visual generation
Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Infinitystar: Unified spacetime autoregressive modeling for visual generation. InNeurIPS, 2025
2025
-
[25]
Ryan Po, Eric Ryan Chan, Changan Chen, and Gordon Wetzstein. Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models.arXiv preprint arXiv:2512.12080, 2025
-
[26]
Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025
-
[27]
Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, et al. Macro-from-micro planning for high-quality and parallelized autoregressive long video generation.arXiv preprint arXiv:2508.03334, 2025
-
[28]
Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
-
[29]
BIFE: Better Interaction, Fewer Errors for Minute-Long Video Generation
Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning
Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation for lightweight autoregressive video history embedding.arXiv preprint arXiv:2512.23851, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Self forcing: Bridging the train-test gap in autoregressive video diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025
2025
-
[32]
Rolling forcing: Autoregressive long video diffusion in real time
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026
2026
-
[33]
Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation
Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, and Min Zhang. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. InCVPR, 2026
2026
-
[34]
Longlive: Real-time interactive long video generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation. InICLR, 2026
2026
-
[35]
Freeman, Fredo Durand, Eli Shechtman, and Xun Huang
Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025. 11
2025
-
[36]
Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation. InICML, 2026
2026
-
[37]
Diffusion forcing: Next-token prediction meets full-sequence diffusion
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024
2024
-
[38]
History- guided video diffusion
Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion. InICML, 2025
2025
-
[39]
Freeman, and Taesung Park
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InCVPR, pages 6613–6623, 2024
2024
-
[40]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024
2024
-
[41]
Flow-grpo: Training flow matching models via online rl
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, 2025
2025
-
[42]
Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026
-
[43]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Streaming autoregressive video generation via diagonal distillation
Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation. InICLR, 2026
2026
-
[45]
Kai Zou, Dian Zheng, Hongbo Liu, Tiankai Hang, Bin Liu, and Nenghai Yu. Hiar: Efficient autoregressive long video generation via hierarchical denoising.arXiv preprint arXiv:2603.08703, 2026
-
[46]
Hanmo Chen, Chenghao Xu, Xu Yang, Xuan Chen, and Cheng Deng. Past- and future-informed kv cache policy with salience estimation in autoregressive video diffusion.arXiv preprint arXiv:2601.21896, 2026
-
[47]
Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025
-
[48]
Motionstream: Real-time video generation with interactive motion controls
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls. InICLR, 2026
2026
-
[49]
Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025
-
[50]
Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025
-
[51]
Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour.arXiv preprint arXiv:2601.16914, 2026
-
[52]
Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, and Hayden Kwok-Hay So. Train short, inference long: Training-free horizon extension for autoregressive video generation.arXiv preprint arXiv:2602.14027, 2026
-
[53]
Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, and Chunchao Guo. Pathwise test-time correction for autoregressive long video generation. arXiv preprint arXiv:2602.05871, 2026
-
[54]
Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout
Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout. InCVPR, 2026
2026
-
[55]
arXiv preprint arXiv:2512.05081 (2025)
Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025. 12
-
[56]
Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026
-
[57]
Training diffusion models with reinforcement learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024
2024
-
[58]
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023
2023
-
[60]
Diffusionnft: Online diffusion reinforcement with forward process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process. InICLR, 2026
2026
-
[61]
Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025
-
[62]
Worldcompass: Reinforce- ment learning for long-horizon world models,
Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, Chunchao Guo, and Zhou Zhao. Worldcompass: Reinforcement learning for long-horizon world models.arXiv preprint arXiv:2602.09022, 2026
-
[63]
Rlvr-world: Training world models with reinforcement learning
Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning. InICLR, 2025
2025
-
[64]
Yang Ye, Tianyu He, Shuo Yang, and Jiang Bian. Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025
-
[65]
Ma, Haoyang Huang, Nan Duan, and Anyi Rao
Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y . Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026
-
[66]
Real-time motion-controllable autoregressive video diffusion
Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, and Hanwang Zhang. Real-time motion-controllable autoregressive video diffusion. InICLR, 2026
2026
-
[67]
Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, and Yifu Sun. Flash-dmd: Towards high-fidelity few-step image generation with efficient distillation and joint reinforcement learning.arXiv preprint arXiv:2511.20549, 2025
-
[68]
Xiefan Guo, Xinzhu Ma, Haoxiang Ma, Zihao Zhou, and Di Huang. Erudiff: Refactoring knowledge in diffusion models for advanced text-to-image synthesis.arXiv preprint arXiv:2603.20828, 2026
-
[69]
Yihong Luo, Tianyang Hu, Weijian Luo, and Jing Tang. Tdm-r1: Reinforcing few-step diffusion models with non-differentiable reward.arXiv preprint arXiv:2603.07700, 2026
-
[70]
Gardo: Reinforcing diffusion models without reward hacking.arXiv preprint arXiv:2512.24138, 2025
Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, and Ling Pan. Gardo: Reinforcing diffusion models without reward hacking.arXiv preprint arXiv:2512.24138, 2025
-
[71]
Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, and Wanli Ouyang. Unigrpo: Unified policy optimization for reasoning-driven visual generation.arXiv preprint arXiv:2603.23500, 2026
-
[72]
Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, et al. Data-regularized reinforcement learning for diffusion models at scale.arXiv preprint arXiv:2512.04332, 2025
-
[73]
Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Diffusion reinforcement learning via centered reward distillation.arXiv preprint arXiv:2603.14128, 2026
-
[74]
Neighbor grpo: Contrastive ode policy optimization aligns flow models
Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, and Hongsheng Li. Neighbor grpo: Contrastive ode policy optimization aligns flow models. InCVPR, 2026
2026
-
[75]
Reinforcing diffusion models by direct group preference optimization
Yihong Luo, Tianyang Hu, and Jing Tang. Reinforcing diffusion models by direct group preference optimization. InICLR, 2026. 13
2026
-
[76]
Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, and Wenpin Tang. Understanding sampler stochasticity in training diffusion models for rlhf.arXiv preprint arXiv:2510.10767, 2025
-
[77]
Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching. arXiv preprint arXiv:2509.05952, 2025
-
[78]
Pc-flow: Preference alignment in flow matching via classifier
Shaomeng Wang, He Wang, Longquan Dai, and Jinhui Tang. Pc-flow: Preference alignment in flow matching via classifier. InAAAI, 2026
2026
-
[79]
E-grpo: High entropy steps drive effective reinforcement learning for flow models
Shengjun Zhang, Zhang Zhang, Chensheng Dai, and Yueqi Duan. E-grpo: High entropy steps drive effective reinforcement learning for flow models. InCVPR Findings, 2026
2026
-
[80]
Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, and Harry Yang. Manifold-aware exploration for reinforcement learning in video generation.arXiv preprint arXiv:2603.21872, 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.