pith. machine review for the scientific record. sign in

arxiv: 2504.13074 · v3 · submitted 2025-04-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SkyReels-V2: Infinite-length Film Generative Model

Boyuan Xu, Chengcheng Ma, Chunze Lin, Debang Li, Di Qiu, Dixuan Lin, Guibin Chen, Hao Zhang, Jiangping Yang, Junchen Zhu, Kang Kang, Mingyuan Fan, Nuo Pang, Peng Zhao, Sheng Chen, Weiming Xiong, Wei Wang, Yahui Zhou, Yang Li, Yubing Song, Yupeng Liang, Yuzhe Jin, Zheng Chen, Zhengcong Fei, Zhiheng Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationdiffusion modelsinfinite length videofilm generationmulti-modal LLMreinforcement learningvideo captioningshot-aware generation
0
0 comments X

The pith

SkyReels-V2 generates infinite-length films by combining language models, reinforcement learning, and diffusion forcing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video generators trade off prompt adherence, visual quality, motion realism, and length, typically limiting outputs to short clips of five to ten seconds. SkyReels-V2 tackles these limits through a structural video representation built from a multi-modal LLM and sub-expert models for shot details, followed by progressive pretraining and a four-stage post-training sequence. The stages include concept-balanced supervised fine-tuning, motion-specific reinforcement learning on annotated and synthetic data, a diffusion forcing framework with non-decreasing noise schedules for long synthesis, and final high-quality fine-tuning. This integrated pipeline is presented as the route to arbitrary-duration film generation that keeps all four qualities in balance.

Core claim

SkyReels-V2 synergizes a Multi-modal Large Language Model, multi-stage pretraining, reinforcement learning, and a diffusion forcing framework to produce an infinite-length film generative model that maintains prompt adherence, visual quality, motion dynamics, and duration without the usual compromises.

What carries the argument

The diffusion forcing framework with non-decreasing noise schedules, which enables long-video synthesis inside an efficient search space while avoiding new artifacts.

If this is right

  • Videos can be produced at any length while preserving both resolution and realistic motion.
  • Cinematic grammar such as shot composition, actor expressions, and camera motions is captured through the unified Video Captioner and expert models.
  • Motion-specific reinforcement learning reduces dynamic artifacts that appear in extended sequences.
  • The final high-quality supervised fine-tuning step refines visual fidelity for film-like results across durations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged training pattern could be tested on narrative consistency across multi-minute character arcs.
  • Adding further expert models for elements like lighting or editing might extend the method beyond current shot-level representation.
  • Deployment on consumer hardware would test whether the efficient search space keeps inference costs practical for long outputs.

Load-bearing premise

The diffusion forcing framework with non-decreasing noise schedules can extend video generation to arbitrary lengths without introducing artifacts or causing quality collapse.

What would settle it

Generate 60-second videos with SkyReels-V2 and measure whether visual quality scores and motion consistency remain comparable to its 5-10 second outputs; sustained degradation would disprove the central claim.

read the original abstract

Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SkyReels-V2, an infinite-length film generative model that integrates a Multi-modal Large Language Model (MLLM) for structural video representation and captioning via SkyCaptioner-V1, multi-stage pretraining with progressive resolution, reinforcement learning for motion-specific refinement using human-annotated and synthetic data, and a diffusion forcing framework with non-decreasing noise schedules to achieve harmonized prompt adherence, visual quality, motion dynamics, and extended duration beyond the typical 5-10 second limit of prior models.

Significance. If the empirical claims hold, the work could advance long-form video generation toward professional film-style output by addressing intertwined limitations in duration, motion quality, and cinematic control that current diffusion and autoregressive approaches face.

major comments (2)
  1. [Abstract] Abstract: The central claims of superior prompt adherence, visual quality, motion dynamics, and infinite-length synthesis are presented without any quantitative metrics, ablation studies, or direct comparisons to baselines, leaving the effectiveness of the four-stage post-training pipeline and the diffusion forcing framework unsupported by visible evidence.
  2. [Abstract] Abstract: The diffusion forcing framework with non-decreasing noise schedules is asserted to enable long-video synthesis in an efficient search space without new artifacts or progressive quality collapse, yet no length-scaling experiments, temporal consistency metrics (e.g., as duration grows from tens of seconds to minutes), or comparisons against standard decreasing-noise baselines are reported.
minor comments (1)
  1. [Abstract] The description of 'shot language' and 'cinematic grammar' (shot composition, actor expressions, camera motions) would benefit from explicit definitions or examples to clarify how sub-expert models differ from general-purpose MLLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments point by point below. We agree that the abstract would be strengthened by incorporating key quantitative highlights and will revise it in the next version while preserving its concise nature.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of superior prompt adherence, visual quality, motion dynamics, and infinite-length synthesis are presented without any quantitative metrics, ablation studies, or direct comparisons to baselines, leaving the effectiveness of the four-stage post-training pipeline and the diffusion forcing framework unsupported by visible evidence.

    Authors: We acknowledge that the abstract emphasizes the methodological pipeline without numerical results. The full manuscript (Sections 4 and 5) reports quantitative evaluations including CLIP-based prompt adherence scores, FID/FVD for visual quality, motion-specific metrics, ablation studies on each post-training stage, and direct comparisons against baselines such as CogVideoX and other long-video models. To address the concern, we will revise the abstract to include the primary quantitative gains (e.g., relative improvements in long-form consistency and quality metrics) while keeping the length appropriate. revision: yes

  2. Referee: [Abstract] Abstract: The diffusion forcing framework with non-decreasing noise schedules is asserted to enable long-video synthesis in an efficient search space without new artifacts or progressive quality collapse, yet no length-scaling experiments, temporal consistency metrics (e.g., as duration grows from tens of seconds to minutes), or comparisons against standard decreasing-noise baselines are reported.

    Authors: The manuscript contains length-scaling experiments and temporal consistency analysis (Section 4.3) that evaluate performance from 10-second clips to multi-minute sequences, reporting metrics such as frame-wise consistency and artifact rates under the non-decreasing schedule versus standard decreasing-noise baselines. These results show maintained quality without progressive collapse. We will add a brief summary of the scaling behavior and key consistency metrics to the abstract to make this evidence visible at the outset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on described engineering integration rather than self-referential reduction

full rationale

The manuscript presents SkyReels-V2 as a composite system integrating MLLM-based captioning, progressive pretraining, RL stages, and a diffusion forcing framework. No equations are supplied that equate any claimed output (e.g., infinite-length coherence) to quantities defined by the same fitted parameters or by construction. The diffusion-forcing description is stated at the architectural level without a mathematical derivation that collapses to the input schedule. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled via prior work. The central claims therefore remain additive engineering assertions rather than tautological re-statements of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the diffusion forcing stage and the assumption that human-annotated plus synthetic distortion data suffice to train motion quality via RL. No new physical entities are postulated.

free parameters (1)
  • non-decreasing noise schedule parameters
    Chosen to enable long-sequence generation; specific values not detailed in abstract.
axioms (1)
  • domain assumption General-purpose MLLMs require sub-expert models to interpret cinematic grammar such as shot composition and camera motions.
    Invoked to justify the structural video representation and SkyCaptioner-V1 training.

pith-pipeline@v0.9.0 · 5679 in / 1259 out tokens · 25687 ms · 2026-05-14T20:18:50.901616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

  2. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  3. FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.

  4. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  5. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...

  6. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...

  7. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.

  8. Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

  9. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  10. Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...

  11. Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

    cs.CV 2026-04 conditional novelty 7.0

    SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

  12. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  13. Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.

  14. Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.

  15. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  16. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  17. Building a Precise Video Language with Human-AI Oversight

    cs.CV 2026-04 unverdicted novelty 6.0

    CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...

  18. TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.

  19. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  20. Repurposing 3D Generative Model for Autoregressive Layout Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.

  21. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  22. Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.

  23. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    cs.CV 2025-06 unverdicted novelty 6.0

    Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...

  24. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.

  25. Motif-Video 2B: Technical Report

    cs.CV 2026-04 unverdicted novelty 5.0

    Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 23 Pith papers · 11 internal anchors

  1. [1]

    Video generation models as world simulators, 2024

    OpenAI. Video generation models as world simulators, 2024

  2. [2]

    Kling, 2024

    Kuaishou. Kling, 2024

  3. [3]

    Hailuo, 2024

    MiniMax. Hailuo, 2024

  4. [4]

    Veo 2, 2024

    DeepMind. Veo 2, 2024

  5. [5]

    Wan: Open and Advanced Large-Scale Video Generative Models

    WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...

  6. [6]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

  7. [7]

    Videojam: Joint appearance-motion representations for enhanced motion generation in video models, 2025

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models, 2025

  8. [8]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  9. [9]

    Gpt-o1, 2024

    OpenAI. Gpt-o1, 2024

  10. [10]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  11. [11]

    Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

    Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. arXiv preprint arXiv:2503.07418, 2025

  12. [12]

    Gen-4, 2024

    RunwayML. Gen-4, 2024. 22

  13. [13]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022

  14. [14]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023

  15. [15]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  16. [16]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022

  17. [17]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

  18. [18]

    Hunyuanvideo: A systematic framework for large video generative models, 2025

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  19. [19]

    Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...

  20. [20]

    Skyreels-v1, 2025

    SkyworkAI. Skyreels-v1, 2025

  21. [22]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

  22. [23]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022

  23. [24]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

  24. [25]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  25. [26]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  26. [27]

    Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model

    Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. arXiv preprint arXiv:2411.17459, 2024. 23

  27. [28]

    Cv-vae: A compatible video vae for latent generative video models

    Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv-vae: A compatible video vae for latent generative video models. https://arxiv.org/abs/2405.20279, 2024

  28. [29]

    NVIDIA et. al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  29. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  30. [31]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  31. [32]

    Available: https://arxiv.org/abs/2304.09151

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151, 2023

  32. [33]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  33. [34]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

  34. [35]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  35. [36]

    Gpt-4o, 2024

    OpenAI. Gpt-4o, 2024

  36. [37]

    Gemini 2.5 pro, 2025

    DeepMind. Gemini 2.5 pro, 2025

  37. [38]

    Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding, 2025

    Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding, 2025

  38. [39]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

  39. [40]

    Advantage-weighted regression: Simple and scalable off-policy reinforcement learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. 2021

  40. [41]

    Aligning text-to-image diffusion models without human feedback

    Tao Liu, Huafeng Kuang, and Xianming Lin. Aligning text-to-image diffusion models without human feedback. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  41. [42]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. IEEE, 2023

  42. [43]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. IEEE, 2023

  43. [44]

    Step-aware preference optimization: Aligning preference with denoising performance at each step

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step. 2024

  44. [45]

    Videodpo: Omni-preference alignment for video diffusion generation, 2024

    Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation, 2024

  45. [46]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improving video generation with human feedback. arXiv preprint arXiv:2501.13918, 2025

  46. [47]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. 2021

  47. [48]

    Unified reward model for multimodal understand- ing and generation, 2025

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understand- ing and generation, 2025

  48. [49]

    Mj-video: Fine-grained benchmarking and rewarding video preferences in video generation, 2025

    Haibo Tong, Zhaoyang Wang, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, Mingyu Ding, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Mj-video: Fine-grained benchmarking and rewarding video preferences in video generation, 2025

  49. [50]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. 24

  50. [51]

    Ar-diffusion: Asynchronous video generation with auto-regressive diffusion, 2025

    Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion, 2025

  51. [52]

    History-guided video diffusion, 2025

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion, 2025

  52. [53]

    Classifier-free diffusion guidance, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022

  53. [54]

    Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models, 2025

  54. [55]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation, 2024

  55. [56]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

  56. [57]

    Long context tuning for video generation, 2025

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation, 2025

  57. [58]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2024

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2024

  58. [59]

    Humanvid: Demystifying training data for camera-controllable human image animation

    Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, and Dahua Lin. Humanvid: Demystifying training data for camera-controllable human image animation. In NeurIPS, 2024

  59. [60]

    Transnet v2: An effective deep network architecture for fast shot transition detection

    Tomáš Souˇcek and Jakub Loko ˇc. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020

  60. [61]

    aesthetic-predictor-v2-5

    Verb. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024. Accessed: 2024.11.12

  61. [62]

    A self-supervised descriptor for image copy detection

    Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. Proc. CVPR, 2022

  62. [63]

    Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling

    Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In Proceedings of European Conference of Computer Vision (ECCV), 2022

  63. [64]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In International Conference on Computer Vision (ICCV), 2023

  64. [65]

    Open source deep end-to-end video quality assessment toolbox, 2022

    Haoning Wu. Open source deep end-to-end video quality assessment toolbox, 2022

  65. [66]

    Arniqa: Learning distortion manifold for image quality assessment

    Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, and Alberto Del Bimbo. Arniqa: Learning distortion manifold for image quality assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 189–198, 2024

  66. [67]

    Character region awareness for text detection

    Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), pages 9365–9374, 2019

  67. [68]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024

  68. [69]

    From static to dynamic: Adapting landmark- aware image models for facial expression recognition in videos

    Yin Chen, Jia Li, Shiguang Shan, Meng Wang, and Richang Hong. From static to dynamic: Adapting landmark- aware image models for facial expression recognition in videos. IEEE Transactions on Affective Computing, page 1–15, 2024

  69. [70]

    Timestep embedding tells: It’s time to cache for video diffusion model

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. arXiv preprint arXiv:2411.19108, 2024

  70. [71]

    Ties in paired-comparison experiments: A generalization of the bradley-terry model

    Pejaver V Rao and Lawrence L Kupper. Ties in paired-comparison experiments: A generalization of the bradley-terry model. Journal of the American Statistical Association, 62(317):194–204, 1967

  71. [72]

    Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023. 25

  72. [73]

    Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization, 2025

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization, 2025

  73. [74]

    Gen-3 alpha, 2025

    RunwayML. Gen-3 alpha, 2025

  74. [75]

    Open-sora 2.0: Training a commercial-level video generation model in $200k

    Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yu...

  75. [76]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019

  76. [77]

    Skyreels-a2: Compose anything in video diffusion transformers

    Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers. arXiv preprint arXiv:2504.02436, 2025

  77. [78]

    subjects

    Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, and Xiang Wen. Skyreels-a1: Expressive portrait animation in video diffusion transformers. arXiv preprint arXiv:2502.10841, 2025. 26 A SkyReels-Bench scoring guidelines Detailed scoring guidelines for SkyReels-Bench Evaluation Scoring Guidelines (1-5 scale) Instruction A...