Recognition: no theorem link
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Pith reviewed 2026-05-15 22:36 UTC · model grok-4.3
The pith
Self-generated segments from a video model steer it to produce coherent four-minute clips without long-video teachers or retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By drawing guidance segments directly from the student model's own long self-generated videos, the method supplies reliable teacher-like signals that prevent error accumulation in the continuous latent space, thereby enabling high-fidelity video generation up to 4 minutes 15 seconds long without any long-horizon teacher or additional training data.
What carries the argument
The self-forcing++ loop that repeatedly samples short segments from the model's own extended generations and feeds them back as conditioning to maintain consistency across the full sequence.
If this is right
- Temporal consistency is preserved while video length scales up to twenty times beyond the teacher's horizon.
- Over-exposure and error accumulation are avoided without recomputing overlapping frames.
- Generation reaches 99.9 percent of the base model's position-embedding span.
- Fidelity and consistency scores exceed those of prior baselines on both standard and newly proposed benchmarks.
- No retraining on long-video datasets is needed.
Where Pith is reading between the lines
- The same self-sampling idea could be tested on other autoregressive modalities such as long audio or 3-D scene sequences.
- If self-guidance works here, it may reduce reliance on ever-larger supervised datasets for extending context in generative models.
- Position-embedding limits rather than error accumulation may become the next practical bottleneck once this technique is applied.
Load-bearing premise
Segments taken from the model's own long outputs remain high-quality enough to act as stable guidance without injecting new compounding errors.
What would settle it
Run the method on a target of five minutes and measure whether visual fidelity or temporal consistency drops measurably below the quality achieved at four minutes fifteen seconds.
read the original abstract
Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Self-Forcing++, a method for long-horizon autoregressive video generation that uses sampled segments from the student model's own self-generated videos as guidance signals. This is intended to mitigate compounding errors in the continuous latent space without access to long-video teacher models or retraining on extended datasets, enabling videos up to 4 minutes 15 seconds (99.9% of the base model's position-embedding span and >50x longer than the baseline) while preserving temporal consistency and visual fidelity.
Significance. If the central premise holds, the work would meaningfully advance scalable video synthesis by demonstrating that self-guidance can substitute for unavailable teacher supervision at extreme lengths, potentially reducing data and compute barriers for minute-scale generation. The reported length extension and benchmark gains would represent a practical step beyond current short-horizon distillation limits.
major comments (3)
- [Method and Experiments] The central claim that self-generated segments supply reliable, non-degrading guidance rests on an unverified premise: no quantitative characterization of latent-space error growth (e.g., drift in continuous codes) within the initial long roll-outs is provided, nor is there an ablation isolating segment quality or a comparison against oracle teacher segments. This directly affects the soundness of the 20x scaling and 4:15 capability assertions.
- [Abstract and Experiments] Abstract and results sections: the reported improvements in fidelity and consistency are stated without accompanying quantitative metrics (specific FVD, FID, or temporal consistency scores with error bars and baseline deltas), segment-selection criteria, or ablation studies on how self-guidance avoids the error accumulation it claims to solve.
- [Results and Scaling Experiments] The 99.9% position-embedding utilization claim is load-bearing for the headline result, yet the manuscript does not detail the mechanism by which self-guidance prevents degradation at this horizon or provide controls (e.g., generation without self-segments) to isolate its contribution.
minor comments (2)
- [Abstract and Experiments] The abstract references 'our proposed improved benchmark' without a clear description of its construction, differences from standard benchmarks, or evaluation protocol in the main text.
- [Method] Notation for segment sampling and guidance injection could be clarified with a concise algorithm box or pseudocode to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We address each of the major comments below and have updated the manuscript to incorporate additional analyses and clarifications as suggested.
read point-by-point responses
-
Referee: [Method and Experiments] The central claim that self-generated segments supply reliable, non-degrading guidance rests on an unverified premise: no quantitative characterization of latent-space error growth (e.g., drift in continuous codes) within the initial long roll-outs is provided, nor is there an ablation isolating segment quality or a comparison against oracle teacher segments. This directly affects the soundness of the 20x scaling and 4:15 capability assertions.
Authors: We agree that a quantitative characterization of latent-space error growth would strengthen the central claim. In the revised manuscript we have added measurements of drift in continuous latent codes across initial long roll-outs. We also include a new ablation that isolates segment quality and compares self-generated segments against oracle teacher segments (where short-horizon teacher outputs can be obtained). These additions directly support the soundness of the reported 20x scaling and 4:15 capability. revision: yes
-
Referee: [Abstract and Experiments] Abstract and results sections: the reported improvements in fidelity and consistency are stated without accompanying quantitative metrics (specific FVD, FID, or temporal consistency scores with error bars and baseline deltas), segment-selection criteria, or ablation studies on how self-guidance avoids the error accumulation it claims to solve.
Authors: We acknowledge that explicit quantitative metrics, segment-selection criteria, and targeted ablations were insufficiently detailed. The revised manuscript now reports specific FVD, FID, and temporal consistency scores with error bars and baseline deltas. We have also added the segment-selection criteria and ablation studies that isolate how self-guidance mitigates error accumulation. These updates appear in both the abstract and the results sections. revision: yes
-
Referee: [Results and Scaling Experiments] The 99.9% position-embedding utilization claim is load-bearing for the headline result, yet the manuscript does not detail the mechanism by which self-guidance prevents degradation at this horizon or provide controls (e.g., generation without self-segments) to isolate its contribution.
Authors: We agree that the mechanism and isolating controls should be made explicit. The revised version explains that self-guidance periodically resets latent-space drift by injecting high-fidelity self-generated segments. We have added control experiments that generate long videos without self-segments; these exhibit rapid degradation, thereby isolating the contribution of self-guidance at the 99.9% position-embedding horizon. revision: yes
Circularity Check
No significant circularity; method is an empirical technique validated experimentally
full rationale
The paper proposes using sampled segments from self-generated long videos as guidance for the student model to extend generation length without long-video teacher data. This is framed as a practical mitigation for error compounding in autoregressive extrapolation, with performance claims (scaling to 99.9% of position-embedding span and 50x baseline) resting on experimental benchmarks rather than any derivation. No equations, parameter fits, or self-citations are shown that reduce the central result to the inputs by construction. The premise that self-samples remain sufficiently high-quality is an empirical assumption subject to external validation, not a self-definitional or fitted-input equivalence. The derivation chain is therefore self-contained against the stated experimental evidence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-generated segments from the student model remain sufficiently high-quality to provide effective guidance equivalent to a bidirectional teacher.
Forward citations
Cited by 21 Pith papers
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
Repurposing 3D Generative Model for Autoregressive Layout Generation
LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
-
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024
work page 2024
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[4]
Videojam: Joint appearance-motion representations for enhanced motion generation in video models
Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models. arXiv preprint arXiv:2502.02492, 2025
-
[5]
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024
work page 2024
-
[6]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Chatbot arena.LMArena, https://lmarena
Wei-Lin Chiang, Anastasios Angelopoulos, L Zheng, Y Sheng, L Dunlap, C Chou, T Li, E Frick, N Jain, D Li, et al. Chatbot arena.LMArena, https://lmarena. ai, 2024
work page 2024
-
[9]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024
-
[11]
Autoregressive video generation without vector quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025
work page 2025
-
[12]
Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025
-
[13]
Veo.https://deepmind.google/models/veo/, 2025
Google DeepMind. Veo.https://deepmind.google/models/veo/, 2025. Accessed: 2025-09-09
work page 2025
-
[14]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advancesin neural information processing systems, 35:8633–8646, 2022. 12
work page 2022
-
[19]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Advancing language model reasoning through reinforcement learning and inference scaling
Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling. arXiv preprint arXiv:2501.11651, 2025
-
[21]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[23]
Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025
-
[24]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InICLR, 2025
work page 2025
-
[25]
Fifo-diffusion: Generating infinite videos from text without training
Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training. Advancesin Neural Information Processing Systems, 37:89834–89868, 2024
work page 2024
-
[26]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[27]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023
work page 2023
-
[28]
Streamdit: Real-time streaming text-to-video generation
Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation. arXiv preprint arXiv:2507.03745, 2025
-
[29]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024
-
[32]
Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,
Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,
- [33]
-
[34]
Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
-
[35]
Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation.arXiv preprint arXiv:2506.09350, 2025
-
[36]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Rolling forcing: Autoregressive long video diffusion in real time, 2025
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time, 2025. URLhttps://arxiv.org/abs/2509.25161. 13
-
[39]
Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?arXiv preprint arXiv:2505.23359, 2025
-
[40]
Zhiheng Liu, Xueqing Deng, Shoufa Chen, Angtian Wang, Qiushan Guo, Mingfei Han, Zeyue Xue, Mengzhao Chen, Ping Luo, and Linjie Yang. Worldweaver: Generating long-horizon video worlds via rich perception.arXiv preprint arXiv:2508.15720, 2025
-
[41]
Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025
Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025
-
[42]
Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025
-
[43]
Optical-flow guided prompt optimization for coherent video generation
Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7837–7846, 2025
work page 2025
-
[44]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. Technical report, OpenAI, February 2024. URL https://openai.com/index/video-generation-models-as-world-simulators/. Technical report
work page 2024
-
[45]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[46]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[47]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Open-sora 2.0: Training a commercial-level video generation model in 200 k
Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in 200 k. arXiv preprint arXiv:2503.09642, 2025
-
[49]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[51]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[52]
Rolling diffusion models, 2024
David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models, 2024. URL https://arxiv.org/abs/2402.09470
-
[53]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023
work page 2023
-
[55]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
work page 2024
-
[57]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang 14 Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024
Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024
work page 2024
-
[59]
Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models
Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advancesin Neural Information Processing Systems, 37:65618–65642, 2024
work page 2024
-
[60]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Zhihui Xie, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong, et al. Teaching language models to critique via reinforcement learning.arXiv preprint arXiv:2502.03492, 2025
-
[63]
Imagereward: learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023
work page 2023
-
[64]
Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, and Yuxiao Dong. Visionreward: Fine-grained multi-dimensional human preference learning for image and video genera...
-
[65]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Longlive: Real-time interactive long video generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, and Song Hanand Yukang Chen. Longlive: Real-time interactive long video generation. 2025
work page 2025
-
[67]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024
work page 2024
-
[69]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024
work page 2024
-
[70]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025
work page 2025
-
[71]
Packing input frame context in next-frame prediction models for video generation
Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025
-
[72]
Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025
-
[73]
Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advancesin Neural Information Processing Systems, 37:131278–131315, 2024. 15 8 Appendix 8.1 Detailed evaluation results for all dimensions Due to limited space, we...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.