SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

Bohan Wang; Ruijie Quan; Xu Zhang; Yi Yang; Yu Lu; Zhaozheng Chen

arxiv: 2606.11969 · v1 · pith:LC5BI77Qnew · submitted 2026-06-10 · 💻 cs.CV

SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

Xu Zhang , Yu Lu , Ruijie Quan , Zhaozheng Chen , Bohan Wang , Yi Yang This is my paper

Pith reviewed 2026-06-27 09:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-video generationflow matchingspectral rectificationmotion coherencelookahead predictionfrequency domainlatent ODEartifact reduction

0 comments

The pith

SpecLoR corrects drifted sampling trajectories in text-to-video generation by rectifying the amplitude spectrum of lookahead clean latents to match natural video priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-matching text-to-video models accumulate velocity and discretization errors that push sampling trajectories away from natural video statistics, producing inconsistent motion and physical artifacts. SpecLoR intervenes during early sampling by predicting a clean latent, computing its three-dimensional spatiotemporal spectrum, and adjusting only the amplitude values to match the statistics of real videos while leaving phase information unchanged. The corrected latent is then re-noised and the ODE integration continues. This frequency-domain step requires four extra network evaluations and operates without retraining or direct spatial edits. A reader would care because the approach offers a lightweight, plug-in way to enforce motion coherence at inference time.

Core claim

SpecLoR performs lookahead to estimate the clean latent z_{t,0} early in sampling, computes its 3D spatiotemporal spectrum, rectifies the amplitude to match the prior of natural videos while leaving phase intact, and re-noises the corrected state to resume ODE integration, reducing physical artifacts and enhancing motion coherence.

What carries the argument

Spectral Lookahead Rectification, which shifts correction to the frequency domain by matching amplitude spectra of early clean latent estimates to natural video priors while preserving phase.

If this is right

Sampling trajectories stay closer to the manifold of natural videos.
Physical artifacts such as inconsistent object trajectories decrease across benchmarks.
Motion coherence improves while adding only four neural function evaluations.
Corrections avoid direct spatial edits that would risk local geometry and incur high cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same amplitude-matching step could be tested on image or 3D generative models that use flow or diffusion sampling.
Adaptive choice of when to apply the lookahead step might further reduce overhead on longer sequences.
Frequency-domain priors may prove more stable than spatial priors when videos contain complex camera motion.
The method leaves open whether phase information alone is always sufficient or whether limited phase adjustments would help in some cases.

Load-bearing premise

The three-dimensional spatiotemporal amplitude spectrum of natural videos supplies a universal, timestep-independent prior that amplitude rectification alone can safely enforce.

What would settle it

Generate matched sets of videos with and without SpecLoR on identical prompts and measure whether physical artifact counts or motion coherence scores differ by a statistically detectable margin.

Figures

Figures reproduced from arXiv: 2606.11969 by Bohan Wang, Ruijie Quan, Xu Zhang, Yi Yang, Yu Lu, Zhaozheng Chen.

**Figure 2.** Figure 2: Trajectory drift and spectral rectification. (a) Visualizing intermediate lookahead predictions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of the proposed SpecLoR method. In Stage 1, the intermediate noisy latent [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: User study on VideoJAM-Bench. User Study. We conduct a blind human preference study on VideoJAM-Bench. Annotators evaluate randomized video pairs (the Wan2.2 baseline vs. SpecLoR) across three criteria: Text Alignment, Motion Coherence, and Video Quality. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Diagnostic comparison of intervention targets (Phase [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of modulation strength (λ). An aggressive intervention with λ = 1.0 (left) causes numerical instability, sometimes visually desynchronizing the initial frame from the sequence. numerical instability in subsequent ODE steps, causing the initial frame to become visually uncoordinated with the rest of the sequence (Fig.7). Setting λ = 0.5 emerges as the optimal sweet spot, providing a firm energy anch… view at source ↗

read the original abstract

Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpecLoR adds a frequency-domain correction on early lookahead latents to reduce drift in flow-based video sampling, but the abstract supplies no numbers and the key assumption about lookahead quality remains untested.

read the letter

The paper's core idea is straightforward: during early ODE steps in text-to-video flow matching, predict the clean latent, compute its 3D spatiotemporal spectrum, replace the amplitude with a natural-video prior while keeping phase, then re-noise and continue. This is presented as a plug-and-play fix that avoids direct spatial edits and adds only four extra NFEs.

What is new is the specific combination of lookahead prediction with amplitude-only rectification in the 3D frequency domain for this exact failure mode. Earlier spectral work in generation exists, but the targeted use here for spatiotemporal drift in latent ODE sampling is not covered in the cited priors.

The approach targets a genuine deployment problem. Accumulated velocity and discretization errors do produce visible inconsistencies, and moving the correction to frequency space is a reasonable way to sidestep local geometry issues.

The soft spots are substantial. The abstract contains zero quantitative results, no error bars, no ablations, and no direct test of whether the early lookahead latent actually carries usable structural signal rather than residual noise. If that lookahead estimate is still dominated by approximation error, matching its amplitude to a fixed prior reduces to imposing an average spectrum on noisy data, which is unlikely to improve coherence. The paper also treats the natural-video amplitude spectrum as a universal, timestep-independent prior without showing supporting measurements.

This is aimed at practitioners who run or fine-tune text-to-video models and want inference-time coherence fixes. Readers working on sampling corrections or frequency-domain priors could extract value if the experiments are solid. It deserves a serious referee because the problem is real and the method is specific enough to be checked.

Send it to review. Referees will need to see the full experimental section, lookahead fidelity plots, and any justification for the prior choice.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference-time correction for flow-matching text-to-video models. During early sampling, it computes a lookahead estimate of the clean latent z_{t,0}, extracts its 3D spatiotemporal amplitude spectrum, rectifies the amplitude to match a fixed prior derived from natural videos while preserving phase, and re-noises the result to continue ODE integration. The central claim is that this reduces physical artifacts and improves motion coherence on the Wan2.2 model with only 4 additional NFEs.

Significance. If the empirical claims hold, SpecLoR would supply a lightweight, training-free mechanism for enforcing universal spectral statistics to counteract drift in latent video ODEs. The frequency-domain formulation avoids direct spatial edits and the explicit use of a timestep-independent natural-video prior is a clear conceptual contribution. The low overhead (4 NFEs) would make it attractive for practical deployment if the gains are reproducible across models and benchmarks.

major comments (3)

[Method description (lookahead rectification step)] The central claim requires that early-stage lookahead estimates z_{t,0} already contain usable structural signal rather than being dominated by velocity-approximation error. No measurement or ablation of lookahead fidelity (e.g., PSNR or spectrum correlation versus timestep or noise level) is supplied, leaving the load-bearing assumption untested.
[Abstract and Experiments section] The abstract states that SpecLoR 'significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks' yet reports neither quantitative metrics, error bars, baseline comparisons, nor ablation results. Without these data the magnitude and reliability of the claimed improvement cannot be assessed.
[Spectral rectification procedure] The method assumes the 3D amplitude spectrum of natural videos constitutes a universal, timestep-independent prior that can be matched by amplitude-only rectification. No derivation or sensitivity analysis is given showing why phase preservation plus amplitude matching is sufficient to restore geometry rather than introducing new inconsistencies.

minor comments (1)

[Method] Notation for the lookahead estimate (z_{t,0}) and the re-noising step should be defined explicitly with reference to the underlying flow-matching ODE.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and will revise the manuscript to incorporate additional analysis and clarifications where appropriate.

read point-by-point responses

Referee: [Method description (lookahead rectification step)] The central claim requires that early-stage lookahead estimates z_{t,0} already contain usable structural signal rather than being dominated by velocity-approximation error. No measurement or ablation of lookahead fidelity (e.g., PSNR or spectrum correlation versus timestep or noise level) is supplied, leaving the load-bearing assumption untested.

Authors: We agree that direct validation of lookahead fidelity strengthens the central assumption. In the revised manuscript we will add an ablation that reports PSNR and 3D spectral correlation between the lookahead estimate z_{t,0} and the corresponding clean latent, evaluated across a range of early timesteps and noise levels on the Wan2.2 model. revision: yes
Referee: [Abstract and Experiments section] The abstract states that SpecLoR 'significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks' yet reports neither quantitative metrics, error bars, baseline comparisons, nor ablation results. Without these data the magnitude and reliability of the claimed improvement cannot be assessed.

Authors: The experiments section already contains quantitative metrics, baseline comparisons, and ablations; however, the abstract presents only a qualitative summary. We will revise the abstract to include specific numerical improvements (with error bars) drawn from the reported results and will ensure all claims are directly supported by the quantitative tables and figures. revision: yes
Referee: [Spectral rectification procedure] The method assumes the 3D amplitude spectrum of natural videos constitutes a universal, timestep-independent prior that can be matched by amplitude-only rectification. No derivation or sensitivity analysis is given showing why phase preservation plus amplitude matching is sufficient to restore geometry rather than introducing new inconsistencies.

Authors: We will add a short derivation in the method section that recalls the classical result that phase encodes structural geometry while amplitude governs energy distribution across frequencies; we will also include a sensitivity study that varies the strength of amplitude rectification and measures resulting geometric consistency (via optical-flow and edge-alignment metrics) to demonstrate that the chosen prior does not introduce new inconsistencies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external natural-video priors

full rationale

The paper presents SpecLoR as an inference-time correction that computes a 3D spatiotemporal spectrum from an early lookahead estimate of the clean latent z_{t,0}, then matches its amplitude to a fixed prior derived from natural videos while preserving phase. No equations, fitted parameters, or self-citations are shown that would make the rectification reduce to a self-definitional fit, a renamed input, or a load-bearing self-citation chain. The prior is described as an external universal statistical property independent of the current sampling trajectory, and the method is framed as a plug-and-play addition rather than a tautological re-expression of its own inputs. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes the existence of stable natural-video spectral statistics and accurate early lookahead.

pith-pipeline@v0.9.1-grok · 5756 in / 1163 out tokens · 16480 ms · 2026-06-27T09:58:50.084801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 26 canonical work pages · 12 internal anchors

[1]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-Omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Phenaki: Variable length video generation from open domain textual descriptions

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. InInt. Conf. Learn. Represent., 2023

2023
[6]

Waver: Wave your way to lifelike video genera- tion.arXiv preprint arXiv:2508.15761, 2025

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

work page arXiv 2025
[7]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInt. Conf. Learn. Represent., 2023

2023
[10]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInt. Conf. Learn. Represent., 2023

2023
[11]

FreeInit: Bridging initialization gap in video diffusion models

Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, and Ziwei Liu. FreeInit: Bridging initialization gap in video diffusion models. InEur. Conf. Comput. Vis., pages 378–394. Springer, 2024

2024
[12]

Restart sampling for improving generative processes

Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes. InAdv. Neural Inform. Process. Syst., pages 76806–76838, 2023

2023
[13]

Self-Refining Video Sampling

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, and Sung Ju Hwang. Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

A general framework for inference-time scaling and steering of diffusion models

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InInt. Conf. Mach. Learn., 2024

2024
[15]

Inference-time text-to-video alignment with diffusion latent beam search

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Inference-time text-to-video alignment with diffusion latent beam search. InAdv. Neural Inform. Process. Syst., 2025

2025
[16]

Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Ling Pan. Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

work page arXiv 2025
[17]

FreqPrior: Improving video diffusion models with frequency filtering gaussian noise

Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Wei Zhang, Hang Xu, and Li Zhang. FreqPrior: Improving video diffusion models with frequency filtering gaussian noise. InInt. Conf. Learn. Represent., 2025

2025
[18]

Statistics of natural time-varying images.Network: computation in neural systems, 6(3):345, 1995

Dawei W Dong and Joseph J Atick. Statistics of natural time-varying images.Network: computation in neural systems, 6(3):345, 1995

1995
[19]

The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981

Alan V Oppenheim and Jae S Lim. The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981

1981
[20]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., volume 33, pages 6840–6851, 2020. 10

2020
[21]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10684– 10695, 2022

2022
[22]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Building normalizing flows with stochastic interpolants

Michael Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInt. Conf. Learn. Represent., 2023

2023
[24]

A theoretical analysis of discrete flow matching generative models.arXiv preprint arXiv:2509.22623, 2025

Maojiang Su, Mingcheng Lu, Jerry Yao-Chieh Hu, Shang Wu, Zhao Song, Alex Reneau, and Han Liu. A theoretical analysis of discrete flow matching generative models.arXiv preprint arXiv:2509.22623, 2025

work page arXiv 2025
[25]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInt. Conf. Comput. Vis., pages 4195–4205, 2023

2023
[26]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. InAdv. Neural Inform. Process. Syst., volume 36, pages 49842–49869, 2023

2023
[28]

Temporal regularization makes your video generator stronger.arXiv preprint arXiv:2503.15417, 2025

Harold Haodong Chen, Haojian Huang, Xianfeng Wu, Yexin Liu, Yajing Bai, Wen-Jie Shu, Harry Yang, and Ser-Nam Lim. Temporal regularization makes your video generator stronger.arXiv preprint arXiv:2503.15417, 2025

work page arXiv 2025
[29]

InfLVG: Reinforce inference- time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. InfLVG: Reinforce inference- time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

work page arXiv 2025
[30]

Video-T1: Test-time scaling for video generation

Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, and Yueqi Duan. Video-T1: Test-time scaling for video generation. InInt. Conf. Comput. Vis., pages 18671–18681, 2025

2025
[31]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Enhance-A-Video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-A-Video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

work page arXiv 2025
[33]

Optical-flow guided prompt optimization for coherent video generation

Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7837–7846, 2025

2025
[34]

FreeNoise: Tuning-free longer video diffusion via noise rescheduling

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. FreeNoise: Tuning-free longer video diffusion via noise rescheduling. InInt. Conf. Learn. Represent., 2024

2024
[35]

FreeLong: Training-free long video generation with spectralblend temporal attention

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. FreeLong: Training-free long video generation with spectralblend temporal attention. InAdv. Neural Inform. Process. Syst., pages 131434–131455, 2024

2024
[36]

LongDiff: Training-free long video generation in one go

Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. LongDiff: Training-free long video generation in one go. InIEEE Conf. Comput. Vis. Pattern Recog., pages 17789–17798, 2025

2025
[37]

VideoGuide: Improving video diffusion models without training through a teacher’s guide

Dohun Lee, Bryan Sangwoo Kim, Geon Yeong Park, and Jong Chul Ye. VideoGuide: Improving video diffusion models without training through a teacher’s guide. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2599–2608, 2025

2025
[38]

Pascal Chang, Jingwei Tang, Markus Gross, and Vinicius C. Azevedo. How i warped your noise: a temporally-correlated noise prior for diffusion models. InInt. Conf. Learn. Represent., 2024

2024
[39]

Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

Mariam Hassan, Bastien Van Delft, Wuyang Li, and Alexandre Alahi. Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

work page arXiv 2025
[40]

Structure from tracking: Distilling structure-preserving motion for video generation.arXiv preprint arXiv:2512.11792, 2025

Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, and Benlin Liu. Structure from tracking: Distilling structure-preserving motion for video generation.arXiv preprint arXiv:2512.11792, 2025. 11

work page arXiv 2025
[41]

arXiv preprint arXiv:2505.13344 (2025)

Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, and Aysegul Dundar. RoPECraft: Training- free motion transfer with trajectory-guided rope optimization on diffusion transformers.arXiv preprint arXiv:2505.13344, 2025

work page arXiv 2025
[42]

MotionRAG: Motion retrieval- augmented image-to-video generation.arXiv preprint arXiv:2509.26391, 2025

Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, and Limin Wang. MotionRAG: Motion retrieval- augmented image-to-video generation.arXiv preprint arXiv:2509.26391, 2025

work page arXiv 2025
[43]

CFG-Zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

Weichen Fan, Amber Yijia Zheng, Raymond A Yeh, and Ziwei Liu. CFG-Zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

work page arXiv 2025
[44]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

2021
[45]

Diffusion rejection sampling

Byeonghu Na, Yeongmin Kim, Minsang Park, Donghyeok Shin, Wanmo Kang, and Il Chul Moon. Diffusion rejection sampling. InInt. Conf. Mach. Learn., volume 235, pages 37097–37121, 2024

2024
[46]

Test-time scaling of diffusion models via noise trajectory search

Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search. InAdv. Neural Inform. Process. Syst., 2025

2025
[47]

arXiv preprint arXiv:2506.01144 (2025)

Ariel Shaulov, Itay Hazan, Lior Wolf, and Hila Chefer. FlowMo: Variance-based flow guidance for coherent motion in video generation.arXiv preprint arXiv:2506.01144, 2025

work page arXiv 2025
[48]

Improved video vae for latent video diffusion model

Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video vae for latent video diffusion model. InIEEE Conf. Comput. Vis. Pattern Recog., pages 18124–18133, 2025

2025
[49]

Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring

Xiaoqian Lv, Shengping Zhang, Chenyang Wang, Yichen Zheng, Bineng Zhong, Chongyi Li, and Liqiang Nie. Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring. InIEEE Conf. Comput. Vis. Pattern Recog., pages 25378–25388, 2024

2024
[50]

FreeU: Free lunch in diffusion u-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. FreeU: Free lunch in diffusion u-net. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4733–4743, 2024

2024
[51]

FAM Diffusion: Frequency and attention modulation for high-resolution image generation with stable diffusion

Haosen Yang, Adrian Bulat, Isma Hadji, Hai X Pham, Xiatian Zhu, Georgios Tzimiropoulos, and Brais Martinez. FAM Diffusion: Frequency and attention modulation for high-resolution image generation with stable diffusion. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2459–2468, 2025

2025
[52]

VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

work page arXiv 2025
[53]

VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

work page arXiv 2025
[54]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21807–21818, 2024

2024
[55]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 12 A Summary of the Appendix We provide additional details and comprehensive analys...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-Omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Phenaki: Variable length video generation from open domain textual descriptions

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. InInt. Conf. Learn. Represent., 2023

2023

[6] [6]

Waver: Wave your way to lifelike video genera- tion.arXiv preprint arXiv:2508.15761, 2025

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

work page arXiv 2025

[7] [7]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInt. Conf. Learn. Represent., 2023

2023

[10] [10]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInt. Conf. Learn. Represent., 2023

2023

[11] [11]

FreeInit: Bridging initialization gap in video diffusion models

Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, and Ziwei Liu. FreeInit: Bridging initialization gap in video diffusion models. InEur. Conf. Comput. Vis., pages 378–394. Springer, 2024

2024

[12] [12]

Restart sampling for improving generative processes

Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes. InAdv. Neural Inform. Process. Syst., pages 76806–76838, 2023

2023

[13] [13]

Self-Refining Video Sampling

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, and Sung Ju Hwang. Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

A general framework for inference-time scaling and steering of diffusion models

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InInt. Conf. Mach. Learn., 2024

2024

[15] [15]

Inference-time text-to-video alignment with diffusion latent beam search

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Inference-time text-to-video alignment with diffusion latent beam search. InAdv. Neural Inform. Process. Syst., 2025

2025

[16] [16]

Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Ling Pan. Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

work page arXiv 2025

[17] [17]

FreqPrior: Improving video diffusion models with frequency filtering gaussian noise

Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Wei Zhang, Hang Xu, and Li Zhang. FreqPrior: Improving video diffusion models with frequency filtering gaussian noise. InInt. Conf. Learn. Represent., 2025

2025

[18] [18]

Statistics of natural time-varying images.Network: computation in neural systems, 6(3):345, 1995

Dawei W Dong and Joseph J Atick. Statistics of natural time-varying images.Network: computation in neural systems, 6(3):345, 1995

1995

[19] [19]

The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981

Alan V Oppenheim and Jae S Lim. The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981

1981

[20] [20]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., volume 33, pages 6840–6851, 2020. 10

2020

[21] [21]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10684– 10695, 2022

2022

[22] [22]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Building normalizing flows with stochastic interpolants

Michael Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInt. Conf. Learn. Represent., 2023

2023

[24] [24]

A theoretical analysis of discrete flow matching generative models.arXiv preprint arXiv:2509.22623, 2025

Maojiang Su, Mingcheng Lu, Jerry Yao-Chieh Hu, Shang Wu, Zhao Song, Alex Reneau, and Han Liu. A theoretical analysis of discrete flow matching generative models.arXiv preprint arXiv:2509.22623, 2025

work page arXiv 2025

[25] [25]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInt. Conf. Comput. Vis., pages 4195–4205, 2023

2023

[26] [26]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. InAdv. Neural Inform. Process. Syst., volume 36, pages 49842–49869, 2023

2023

[28] [28]

Temporal regularization makes your video generator stronger.arXiv preprint arXiv:2503.15417, 2025

Harold Haodong Chen, Haojian Huang, Xianfeng Wu, Yexin Liu, Yajing Bai, Wen-Jie Shu, Harry Yang, and Ser-Nam Lim. Temporal regularization makes your video generator stronger.arXiv preprint arXiv:2503.15417, 2025

work page arXiv 2025

[29] [29]

InfLVG: Reinforce inference- time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. InfLVG: Reinforce inference- time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

work page arXiv 2025

[30] [30]

Video-T1: Test-time scaling for video generation

Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, and Yueqi Duan. Video-T1: Test-time scaling for video generation. InInt. Conf. Comput. Vis., pages 18671–18681, 2025

2025

[31] [31]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Enhance-A-Video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-A-Video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

work page arXiv 2025

[33] [33]

Optical-flow guided prompt optimization for coherent video generation

Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7837–7846, 2025

2025

[34] [34]

FreeNoise: Tuning-free longer video diffusion via noise rescheduling

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. FreeNoise: Tuning-free longer video diffusion via noise rescheduling. InInt. Conf. Learn. Represent., 2024

2024

[35] [35]

FreeLong: Training-free long video generation with spectralblend temporal attention

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. FreeLong: Training-free long video generation with spectralblend temporal attention. InAdv. Neural Inform. Process. Syst., pages 131434–131455, 2024

2024

[36] [36]

LongDiff: Training-free long video generation in one go

Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. LongDiff: Training-free long video generation in one go. InIEEE Conf. Comput. Vis. Pattern Recog., pages 17789–17798, 2025

2025

[37] [37]

VideoGuide: Improving video diffusion models without training through a teacher’s guide

Dohun Lee, Bryan Sangwoo Kim, Geon Yeong Park, and Jong Chul Ye. VideoGuide: Improving video diffusion models without training through a teacher’s guide. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2599–2608, 2025

2025

[38] [38]

Pascal Chang, Jingwei Tang, Markus Gross, and Vinicius C. Azevedo. How i warped your noise: a temporally-correlated noise prior for diffusion models. InInt. Conf. Learn. Represent., 2024

2024

[39] [39]

Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

Mariam Hassan, Bastien Van Delft, Wuyang Li, and Alexandre Alahi. Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

work page arXiv 2025

[40] [40]

Structure from tracking: Distilling structure-preserving motion for video generation.arXiv preprint arXiv:2512.11792, 2025

Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, and Benlin Liu. Structure from tracking: Distilling structure-preserving motion for video generation.arXiv preprint arXiv:2512.11792, 2025. 11

work page arXiv 2025

[41] [41]

arXiv preprint arXiv:2505.13344 (2025)

Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, and Aysegul Dundar. RoPECraft: Training- free motion transfer with trajectory-guided rope optimization on diffusion transformers.arXiv preprint arXiv:2505.13344, 2025

work page arXiv 2025

[42] [42]

MotionRAG: Motion retrieval- augmented image-to-video generation.arXiv preprint arXiv:2509.26391, 2025

Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, and Limin Wang. MotionRAG: Motion retrieval- augmented image-to-video generation.arXiv preprint arXiv:2509.26391, 2025

work page arXiv 2025

[43] [43]

CFG-Zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

Weichen Fan, Amber Yijia Zheng, Raymond A Yeh, and Ziwei Liu. CFG-Zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

work page arXiv 2025

[44] [44]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

2021

[45] [45]

Diffusion rejection sampling

Byeonghu Na, Yeongmin Kim, Minsang Park, Donghyeok Shin, Wanmo Kang, and Il Chul Moon. Diffusion rejection sampling. InInt. Conf. Mach. Learn., volume 235, pages 37097–37121, 2024

2024

[46] [46]

Test-time scaling of diffusion models via noise trajectory search

Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search. InAdv. Neural Inform. Process. Syst., 2025

2025

[47] [47]

arXiv preprint arXiv:2506.01144 (2025)

Ariel Shaulov, Itay Hazan, Lior Wolf, and Hila Chefer. FlowMo: Variance-based flow guidance for coherent motion in video generation.arXiv preprint arXiv:2506.01144, 2025

work page arXiv 2025

[48] [48]

Improved video vae for latent video diffusion model

Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video vae for latent video diffusion model. InIEEE Conf. Comput. Vis. Pattern Recog., pages 18124–18133, 2025

2025

[49] [49]

Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring

Xiaoqian Lv, Shengping Zhang, Chenyang Wang, Yichen Zheng, Bineng Zhong, Chongyi Li, and Liqiang Nie. Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring. InIEEE Conf. Comput. Vis. Pattern Recog., pages 25378–25388, 2024

2024

[50] [50]

FreeU: Free lunch in diffusion u-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. FreeU: Free lunch in diffusion u-net. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4733–4743, 2024

2024

[51] [51]

FAM Diffusion: Frequency and attention modulation for high-resolution image generation with stable diffusion

Haosen Yang, Adrian Bulat, Isma Hadji, Hai X Pham, Xiatian Zhu, Georgios Tzimiropoulos, and Brais Martinez. FAM Diffusion: Frequency and attention modulation for high-resolution image generation with stable diffusion. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2459–2468, 2025

2025

[52] [52]

VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

work page arXiv 2025

[53] [53]

VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

work page arXiv 2025

[54] [54]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21807–21818, 2024

2024

[55] [55]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 12 A Summary of the Appendix We provide additional details and comprehensive analys...

work page internal anchor Pith review Pith/arXiv arXiv 2023