arxiv: 2504.13074 · v3 · submitted 2025-04-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SkyReels-V2: Infinite-length Film Generative Model

Boyuan Xu, Chengcheng Ma, Chunze Lin, Debang Li, Di Qiu, Dixuan Lin, Guibin Chen, Hao Zhang, Jiangping Yang, Junchen Zhu, Kang Kang, Mingyuan Fan, Nuo Pang, Peng Zhao, Sheng Chen, Weiming Xiong, Wei Wang, Yahui Zhou, Yang Li, Yubing Song, Yupeng Liang, Yuzhe Jin, Zheng Chen, Zhengcong Fei, Zhiheng Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationdiffusion modelsinfinite length videofilm generationmulti-modal LLMreinforcement learningvideo captioningshot-aware generation

0 comments

The pith

SkyReels-V2 generates infinite-length films by combining language models, reinforcement learning, and diffusion forcing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video generators trade off prompt adherence, visual quality, motion realism, and length, typically limiting outputs to short clips of five to ten seconds. SkyReels-V2 tackles these limits through a structural video representation built from a multi-modal LLM and sub-expert models for shot details, followed by progressive pretraining and a four-stage post-training sequence. The stages include concept-balanced supervised fine-tuning, motion-specific reinforcement learning on annotated and synthetic data, a diffusion forcing framework with non-decreasing noise schedules for long synthesis, and final high-quality fine-tuning. This integrated pipeline is presented as the route to arbitrary-duration film generation that keeps all four qualities in balance.

Core claim

SkyReels-V2 synergizes a Multi-modal Large Language Model, multi-stage pretraining, reinforcement learning, and a diffusion forcing framework to produce an infinite-length film generative model that maintains prompt adherence, visual quality, motion dynamics, and duration without the usual compromises.

What carries the argument

The diffusion forcing framework with non-decreasing noise schedules, which enables long-video synthesis inside an efficient search space while avoiding new artifacts.

If this is right

Videos can be produced at any length while preserving both resolution and realistic motion.
Cinematic grammar such as shot composition, actor expressions, and camera motions is captured through the unified Video Captioner and expert models.
Motion-specific reinforcement learning reduces dynamic artifacts that appear in extended sequences.
The final high-quality supervised fine-tuning step refines visual fidelity for film-like results across durations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged training pattern could be tested on narrative consistency across multi-minute character arcs.
Adding further expert models for elements like lighting or editing might extend the method beyond current shot-level representation.
Deployment on consumer hardware would test whether the efficient search space keeps inference costs practical for long outputs.

Load-bearing premise

The diffusion forcing framework with non-decreasing noise schedules can extend video generation to arbitrary lengths without introducing artifacts or causing quality collapse.

What would settle it

Generate 60-second videos with SkyReels-V2 and measure whether visual quality scores and motion consistency remain comparable to its 5-10 second outputs; sustained degradation would disprove the central claim.

read the original abstract

Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkyReels-V2 describes a four-stage post-training pipeline plus a new captioner for longer video generation, but the abstract supplies no metrics or ablations to support the performance claims.

read the letter

SkyReels-V2 puts forward a concrete four-stage post-training recipe on top of a custom video captioner to tackle prompt adherence, motion quality, and length in one system. The abstract lays out the pieces in order: progressive pretraining, concept-balanced SFT, motion RL with human and synthetic data, diffusion forcing with non-decreasing noise, and final high-quality SFT. That ordering is straightforward and matches the problems the authors list.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SkyReels-V2, an infinite-length film generative model that integrates a Multi-modal Large Language Model (MLLM) for structural video representation and captioning via SkyCaptioner-V1, multi-stage pretraining with progressive resolution, reinforcement learning for motion-specific refinement using human-annotated and synthetic data, and a diffusion forcing framework with non-decreasing noise schedules to achieve harmonized prompt adherence, visual quality, motion dynamics, and extended duration beyond the typical 5-10 second limit of prior models.

Significance. If the empirical claims hold, the work could advance long-form video generation toward professional film-style output by addressing intertwined limitations in duration, motion quality, and cinematic control that current diffusion and autoregressive approaches face.

major comments (2)

[Abstract] Abstract: The central claims of superior prompt adherence, visual quality, motion dynamics, and infinite-length synthesis are presented without any quantitative metrics, ablation studies, or direct comparisons to baselines, leaving the effectiveness of the four-stage post-training pipeline and the diffusion forcing framework unsupported by visible evidence.
[Abstract] Abstract: The diffusion forcing framework with non-decreasing noise schedules is asserted to enable long-video synthesis in an efficient search space without new artifacts or progressive quality collapse, yet no length-scaling experiments, temporal consistency metrics (e.g., as duration grows from tens of seconds to minutes), or comparisons against standard decreasing-noise baselines are reported.

minor comments (1)

[Abstract] The description of 'shot language' and 'cinematic grammar' (shot composition, actor expressions, camera motions) would benefit from explicit definitions or examples to clarify how sub-expert models differ from general-purpose MLLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments point by point below. We agree that the abstract would be strengthened by incorporating key quantitative highlights and will revise it in the next version while preserving its concise nature.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of superior prompt adherence, visual quality, motion dynamics, and infinite-length synthesis are presented without any quantitative metrics, ablation studies, or direct comparisons to baselines, leaving the effectiveness of the four-stage post-training pipeline and the diffusion forcing framework unsupported by visible evidence.

Authors: We acknowledge that the abstract emphasizes the methodological pipeline without numerical results. The full manuscript (Sections 4 and 5) reports quantitative evaluations including CLIP-based prompt adherence scores, FID/FVD for visual quality, motion-specific metrics, ablation studies on each post-training stage, and direct comparisons against baselines such as CogVideoX and other long-video models. To address the concern, we will revise the abstract to include the primary quantitative gains (e.g., relative improvements in long-form consistency and quality metrics) while keeping the length appropriate. revision: yes
Referee: [Abstract] Abstract: The diffusion forcing framework with non-decreasing noise schedules is asserted to enable long-video synthesis in an efficient search space without new artifacts or progressive quality collapse, yet no length-scaling experiments, temporal consistency metrics (e.g., as duration grows from tens of seconds to minutes), or comparisons against standard decreasing-noise baselines are reported.

Authors: The manuscript contains length-scaling experiments and temporal consistency analysis (Section 4.3) that evaluate performance from 10-second clips to multi-minute sequences, reporting metrics such as frame-wise consistency and artifact rates under the non-decreasing schedule versus standard decreasing-noise baselines. These results show maintained quality without progressive collapse. We will add a brief summary of the scaling behavior and key consistency metrics to the abstract to make this evidence visible at the outset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on described engineering integration rather than self-referential reduction

full rationale

The manuscript presents SkyReels-V2 as a composite system integrating MLLM-based captioning, progressive pretraining, RL stages, and a diffusion forcing framework. No equations are supplied that equate any claimed output (e.g., infinite-length coherence) to quantities defined by the same fitted parameters or by construction. The diffusion-forcing description is stated at the architectural level without a mathematical derivation that collapses to the input schedule. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled via prior work. The central claims therefore remain additive engineering assertions rather than tautological re-statements of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the diffusion forcing stage and the assumption that human-annotated plus synthetic distortion data suffice to train motion quality via RL. No new physical entities are postulated.

free parameters (1)

non-decreasing noise schedule parameters
Chosen to enable long-sequence generation; specific values not detailed in abstract.

axioms (1)

domain assumption General-purpose MLLMs require sub-expert models to interpret cinematic grammar such as shot composition and camera motions.
Invoked to justify the structural video representation and SkyCaptioner-V1 training.

pith-pipeline@v0.9.0 · 5679 in / 1259 out tokens · 25687 ms · 2026-05-14T20:18:50.901616+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
cs.CV 2026-04 conditional novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
Building a Precise Video Language with Human-AI Oversight
cs.CV 2026-04 unverdicted novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Repurposing 3D Generative Model for Autoregressive Layout Generation
cs.CV 2026-04 unverdicted novelty 6.0

LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 5.0

Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 23 Pith papers · 11 internal anchors

[1]

Video generation models as world simulators, 2024

OpenAI. Video generation models as world simulators, 2024

work page 2024
[2]

Kling, 2024

Kuaishou. Kling, 2024

work page 2024
[3]

Hailuo, 2024

MiniMax. Hailuo, 2024

work page 2024
[4]

Veo 2, 2024

DeepMind. Veo 2, 2024

work page 2024
[5]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

work page 2024
[7]

Videojam: Joint appearance-motion representations for enhanced motion generation in video models, 2025

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models, 2025

work page 2025
[8]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Gpt-o1, 2024

OpenAI. Gpt-o1, 2024

work page 2024
[10]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[11]

Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. arXiv preprint arXiv:2503.07418, 2025

work page arXiv 2025
[12]

Gen-4, 2024

RunwayML. Gen-4, 2024. 22

work page 2024
[13]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023

work page 2023
[15]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022

work page 2022
[17]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Hunyuanvideo: A systematic framework for large video generative models, 2025

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page 2025
[19]

Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...

work page 2025
[20]

Skyreels-v1, 2025

SkyworkAI. Skyreels-v1, 2025

work page 2025
[22]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

work page 2015
[23]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

work page 2024
[25]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[26]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model

Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. arXiv preprint arXiv:2411.17459, 2024. 23

work page arXiv 2024
[28]

Cv-vae: A compatible video vae for latent generative video models

Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv-vae: A compatible video vae for latent generative video models. https://arxiv.org/abs/2405.20279, 2024

work page arXiv 2024
[29]

NVIDIA et. al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[31]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[32]

Available: https://arxiv.org/abs/2304.09151

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151, 2023

work page arXiv 2023
[33]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[34]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

work page 2022
[35]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Gpt-4o, 2024

OpenAI. Gpt-4o, 2024

work page 2024
[37]

Gemini 2.5 pro, 2025

DeepMind. Gemini 2.5 pro, 2025

work page 2025
[38]

Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding, 2025

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding, 2025

work page 2025
[39]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[40]

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. 2021

work page 2021
[41]

Aligning text-to-image diffusion models without human feedback

Tao Liu, Huafeng Kuang, and Xianming Lin. Aligning text-to-image diffusion models without human feedback. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

work page 2025
[42]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. IEEE, 2023

work page 2023
[43]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. IEEE, 2023

work page 2023
[44]

Step-aware preference optimization: Aligning preference with denoising performance at each step

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step. 2024

work page 2024
[45]

Videodpo: Omni-preference alignment for video diffusion generation, 2024

Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation, 2024

work page 2024
[46]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improving video generation with human feedback. arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. 2021

work page 2021
[48]

Unified reward model for multimodal understand- ing and generation, 2025

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understand- ing and generation, 2025

work page 2025
[49]

Mj-video: Fine-grained benchmarking and rewarding video preferences in video generation, 2025

Haibo Tong, Zhaoyang Wang, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, Mingyu Ding, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Mj-video: Fine-grained benchmarking and rewarding video preferences in video generation, 2025

work page 2025
[50]

Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. 24

work page 2024
[51]

Ar-diffusion: Asynchronous video generation with auto-regressive diffusion, 2025

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion, 2025

work page 2025
[52]

History-guided video diffusion, 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion, 2025

work page 2025
[53]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022

work page 2022
[54]

Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models, 2025

work page 2025
[55]

Freeman, and Taesung Park

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation, 2024

work page 2024
[56]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

work page 2024
[57]

Long context tuning for video generation, 2025

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation, 2025

work page 2025
[58]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2024

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2024

work page 2024
[59]

Humanvid: Demystifying training data for camera-controllable human image animation

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, and Dahua Lin. Humanvid: Demystifying training data for camera-controllable human image animation. In NeurIPS, 2024

work page 2024
[60]

Transnet v2: An effective deep network architecture for fast shot transition detection

Tomáš Souˇcek and Jakub Loko ˇc. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020

work page arXiv 2008
[61]

aesthetic-predictor-v2-5

Verb. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024. Accessed: 2024.11.12

work page 2024
[62]

A self-supervised descriptor for image copy detection

Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. Proc. CVPR, 2022

work page 2022
[63]

Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In Proceedings of European Conference of Computer Vision (ECCV), 2022

work page 2022
[64]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In International Conference on Computer Vision (ICCV), 2023

work page 2023
[65]

Open source deep end-to-end video quality assessment toolbox, 2022

Haoning Wu. Open source deep end-to-end video quality assessment toolbox, 2022

work page 2022
[66]

Arniqa: Learning distortion manifold for image quality assessment

Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, and Alberto Del Bimbo. Arniqa: Learning distortion manifold for image quality assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 189–198, 2024

work page 2024
[67]

Character region awareness for text detection

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), pages 9365–9374, 2019

work page 2019
[68]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

From static to dynamic: Adapting landmark- aware image models for facial expression recognition in videos

Yin Chen, Jia Li, Shiguang Shan, Meng Wang, and Richang Hong. From static to dynamic: Adapting landmark- aware image models for facial expression recognition in videos. IEEE Transactions on Affective Computing, page 1–15, 2024

work page 2024
[70]

Timestep embedding tells: It’s time to cache for video diffusion model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. arXiv preprint arXiv:2411.19108, 2024

work page arXiv 2024
[71]

Ties in paired-comparison experiments: A generalization of the bradley-terry model

Pejaver V Rao and Lawrence L Kupper. Ties in paired-comparison experiments: A generalization of the bradley-terry model. Journal of the American Statistical Association, 62(317):194–204, 1967

work page 1967
[72]

Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023. 25

work page 2023
[73]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization, 2025

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization, 2025

work page 2025
[74]

Gen-3 alpha, 2025

RunwayML. Gen-3 alpha, 2025

work page 2025
[75]

Open-sora 2.0: Training a commercial-level video generation model in $200k

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yu...

work page arXiv 2025
[76]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019

work page 2019
[77]

Skyreels-a2: Compose anything in video diffusion transformers

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers. arXiv preprint arXiv:2504.02436, 2025

work page arXiv 2025
[78]

subjects

Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, and Xiang Wen. Skyreels-a1: Expressive portrait animation in video diffusion transformers. arXiv preprint arXiv:2502.10841, 2025. 26 A SkyReels-Bench scoring guidelines Detailed scoring guidelines for SkyReels-Bench Evaluation Scoring Guidelines (1-5 scale) Instruction A...

work page arXiv 2025