pith. sign in

arxiv: 2606.11969 · v1 · pith:LC5BI77Qnew · submitted 2026-06-10 · 💻 cs.CV

SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

Pith reviewed 2026-06-27 09:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video generationflow matchingspectral rectificationmotion coherencelookahead predictionfrequency domainlatent ODEartifact reduction
0
0 comments X

The pith

SpecLoR corrects drifted sampling trajectories in text-to-video generation by rectifying the amplitude spectrum of lookahead clean latents to match natural video priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-matching text-to-video models accumulate velocity and discretization errors that push sampling trajectories away from natural video statistics, producing inconsistent motion and physical artifacts. SpecLoR intervenes during early sampling by predicting a clean latent, computing its three-dimensional spatiotemporal spectrum, and adjusting only the amplitude values to match the statistics of real videos while leaving phase information unchanged. The corrected latent is then re-noised and the ODE integration continues. This frequency-domain step requires four extra network evaluations and operates without retraining or direct spatial edits. A reader would care because the approach offers a lightweight, plug-in way to enforce motion coherence at inference time.

Core claim

SpecLoR performs lookahead to estimate the clean latent z_{t,0} early in sampling, computes its 3D spatiotemporal spectrum, rectifies the amplitude to match the prior of natural videos while leaving phase intact, and re-noises the corrected state to resume ODE integration, reducing physical artifacts and enhancing motion coherence.

What carries the argument

Spectral Lookahead Rectification, which shifts correction to the frequency domain by matching amplitude spectra of early clean latent estimates to natural video priors while preserving phase.

If this is right

  • Sampling trajectories stay closer to the manifold of natural videos.
  • Physical artifacts such as inconsistent object trajectories decrease across benchmarks.
  • Motion coherence improves while adding only four neural function evaluations.
  • Corrections avoid direct spatial edits that would risk local geometry and incur high cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same amplitude-matching step could be tested on image or 3D generative models that use flow or diffusion sampling.
  • Adaptive choice of when to apply the lookahead step might further reduce overhead on longer sequences.
  • Frequency-domain priors may prove more stable than spatial priors when videos contain complex camera motion.
  • The method leaves open whether phase information alone is always sufficient or whether limited phase adjustments would help in some cases.

Load-bearing premise

The three-dimensional spatiotemporal amplitude spectrum of natural videos supplies a universal, timestep-independent prior that amplitude rectification alone can safely enforce.

What would settle it

Generate matched sets of videos with and without SpecLoR on identical prompts and measure whether physical artifact counts or motion coherence scores differ by a statistically detectable margin.

Figures

Figures reproduced from arXiv: 2606.11969 by Bohan Wang, Ruijie Quan, Xu Zhang, Yi Yang, Yu Lu, Zhaozheng Chen.

Figure 1
Figure 1. Figure 1: Concept and pipeline of SpecLoR. (a) Accumulated errors drift the Flow Matching trajectory [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory drift and spectral rectification. (a) Visualizing intermediate lookahead predictions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of the proposed SpecLoR method. In Stage 1, the intermediate noisy latent [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: User study on VideoJAM-Bench. User Study. We conduct a blind human pref￾erence study on VideoJAM-Bench. Annotators evaluate randomized video pairs (the Wan2.2 baseline vs. SpecLoR) across three criteria: Text Alignment, Motion Coherence, and Video Qual￾ity. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Diagnostic comparison of intervention targets (Phase [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of modulation strength (λ). An aggressive intervention with λ = 1.0 (left) causes numerical instability, sometimes visually desynchronizing the initial frame from the sequence. numerical instability in subsequent ODE steps, causing the initial frame to become visually unco￾ordinated with the rest of the sequence (Fig.7). Setting λ = 0.5 emerges as the optimal sweet spot, providing a firm energy anch… view at source ↗
read the original abstract

Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference-time correction for flow-matching text-to-video models. During early sampling, it computes a lookahead estimate of the clean latent z_{t,0}, extracts its 3D spatiotemporal amplitude spectrum, rectifies the amplitude to match a fixed prior derived from natural videos while preserving phase, and re-noises the result to continue ODE integration. The central claim is that this reduces physical artifacts and improves motion coherence on the Wan2.2 model with only 4 additional NFEs.

Significance. If the empirical claims hold, SpecLoR would supply a lightweight, training-free mechanism for enforcing universal spectral statistics to counteract drift in latent video ODEs. The frequency-domain formulation avoids direct spatial edits and the explicit use of a timestep-independent natural-video prior is a clear conceptual contribution. The low overhead (4 NFEs) would make it attractive for practical deployment if the gains are reproducible across models and benchmarks.

major comments (3)
  1. [Method description (lookahead rectification step)] The central claim requires that early-stage lookahead estimates z_{t,0} already contain usable structural signal rather than being dominated by velocity-approximation error. No measurement or ablation of lookahead fidelity (e.g., PSNR or spectrum correlation versus timestep or noise level) is supplied, leaving the load-bearing assumption untested.
  2. [Abstract and Experiments section] The abstract states that SpecLoR 'significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks' yet reports neither quantitative metrics, error bars, baseline comparisons, nor ablation results. Without these data the magnitude and reliability of the claimed improvement cannot be assessed.
  3. [Spectral rectification procedure] The method assumes the 3D amplitude spectrum of natural videos constitutes a universal, timestep-independent prior that can be matched by amplitude-only rectification. No derivation or sensitivity analysis is given showing why phase preservation plus amplitude matching is sufficient to restore geometry rather than introducing new inconsistencies.
minor comments (1)
  1. [Method] Notation for the lookahead estimate (z_{t,0}) and the re-noising step should be defined explicitly with reference to the underlying flow-matching ODE.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and will revise the manuscript to incorporate additional analysis and clarifications where appropriate.

read point-by-point responses
  1. Referee: [Method description (lookahead rectification step)] The central claim requires that early-stage lookahead estimates z_{t,0} already contain usable structural signal rather than being dominated by velocity-approximation error. No measurement or ablation of lookahead fidelity (e.g., PSNR or spectrum correlation versus timestep or noise level) is supplied, leaving the load-bearing assumption untested.

    Authors: We agree that direct validation of lookahead fidelity strengthens the central assumption. In the revised manuscript we will add an ablation that reports PSNR and 3D spectral correlation between the lookahead estimate z_{t,0} and the corresponding clean latent, evaluated across a range of early timesteps and noise levels on the Wan2.2 model. revision: yes

  2. Referee: [Abstract and Experiments section] The abstract states that SpecLoR 'significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks' yet reports neither quantitative metrics, error bars, baseline comparisons, nor ablation results. Without these data the magnitude and reliability of the claimed improvement cannot be assessed.

    Authors: The experiments section already contains quantitative metrics, baseline comparisons, and ablations; however, the abstract presents only a qualitative summary. We will revise the abstract to include specific numerical improvements (with error bars) drawn from the reported results and will ensure all claims are directly supported by the quantitative tables and figures. revision: yes

  3. Referee: [Spectral rectification procedure] The method assumes the 3D amplitude spectrum of natural videos constitutes a universal, timestep-independent prior that can be matched by amplitude-only rectification. No derivation or sensitivity analysis is given showing why phase preservation plus amplitude matching is sufficient to restore geometry rather than introducing new inconsistencies.

    Authors: We will add a short derivation in the method section that recalls the classical result that phase encodes structural geometry while amplitude governs energy distribution across frequencies; we will also include a sensitivity study that varies the strength of amplitude rectification and measures resulting geometric consistency (via optical-flow and edge-alignment metrics) to demonstrate that the chosen prior does not introduce new inconsistencies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external natural-video priors

full rationale

The paper presents SpecLoR as an inference-time correction that computes a 3D spatiotemporal spectrum from an early lookahead estimate of the clean latent z_{t,0}, then matches its amplitude to a fixed prior derived from natural videos while preserving phase. No equations, fitted parameters, or self-citations are shown that would make the rectification reduce to a self-definitional fit, a renamed input, or a load-bearing self-citation chain. The prior is described as an external universal statistical property independent of the current sampling trajectory, and the method is framed as a plug-and-play addition rather than a tautological re-expression of its own inputs. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes the existence of stable natural-video spectral statistics and accurate early lookahead.

pith-pipeline@v0.9.1-grok · 5756 in / 1163 out tokens · 16480 ms · 2026-06-27T09:58:50.084801+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 26 canonical work pages · 12 internal anchors

  1. [1]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  2. [2]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  3. [3]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

  4. [4]

    Kling-Omni Technical Report

    Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-Omni technical report.arXiv preprint arXiv:2512.16776, 2025

  5. [5]

    Phenaki: Variable length video generation from open domain textual descriptions

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. InInt. Conf. Learn. Represent., 2023

  6. [6]

    Waver: Wave your way to lifelike video genera- tion.arXiv preprint arXiv:2508.15761, 2025

    Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

  7. [7]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  8. [8]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  9. [9]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInt. Conf. Learn. Represent., 2023

  10. [10]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInt. Conf. Learn. Represent., 2023

  11. [11]

    FreeInit: Bridging initialization gap in video diffusion models

    Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, and Ziwei Liu. FreeInit: Bridging initialization gap in video diffusion models. InEur. Conf. Comput. Vis., pages 378–394. Springer, 2024

  12. [12]

    Restart sampling for improving generative processes

    Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes. InAdv. Neural Inform. Process. Syst., pages 76806–76838, 2023

  13. [13]

    Self-Refining Video Sampling

    Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, and Sung Ju Hwang. Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

  14. [14]

    A general framework for inference-time scaling and steering of diffusion models

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InInt. Conf. Mach. Learn., 2024

  15. [15]

    Inference-time text-to-video alignment with diffusion latent beam search

    Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Inference-time text-to-video alignment with diffusion latent beam search. InAdv. Neural Inform. Process. Syst., 2025

  16. [16]

    Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

    Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Ling Pan. Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

  17. [17]

    FreqPrior: Improving video diffusion models with frequency filtering gaussian noise

    Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Wei Zhang, Hang Xu, and Li Zhang. FreqPrior: Improving video diffusion models with frequency filtering gaussian noise. InInt. Conf. Learn. Represent., 2025

  18. [18]

    Statistics of natural time-varying images.Network: computation in neural systems, 6(3):345, 1995

    Dawei W Dong and Joseph J Atick. Statistics of natural time-varying images.Network: computation in neural systems, 6(3):345, 1995

  19. [19]

    The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981

    Alan V Oppenheim and Jae S Lim. The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981

  20. [20]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., volume 33, pages 6840–6851, 2020. 10

  21. [21]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10684– 10695, 2022

  22. [22]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  23. [23]

    Building normalizing flows with stochastic interpolants

    Michael Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInt. Conf. Learn. Represent., 2023

  24. [24]

    A theoretical analysis of discrete flow matching generative models.arXiv preprint arXiv:2509.22623, 2025

    Maojiang Su, Mingcheng Lu, Jerry Yao-Chieh Hu, Shang Wu, Zhao Song, Alex Reneau, and Han Liu. A theoretical analysis of discrete flow matching generative models.arXiv preprint arXiv:2509.22623, 2025

  25. [25]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInt. Conf. Comput. Vis., pages 4195–4205, 2023

  26. [26]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  27. [27]

    Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. InAdv. Neural Inform. Process. Syst., volume 36, pages 49842–49869, 2023

  28. [28]

    Temporal regularization makes your video generator stronger.arXiv preprint arXiv:2503.15417, 2025

    Harold Haodong Chen, Haojian Huang, Xianfeng Wu, Yexin Liu, Yajing Bai, Wen-Jie Shu, Harry Yang, and Ser-Nam Lim. Temporal regularization makes your video generator stronger.arXiv preprint arXiv:2503.15417, 2025

  29. [29]

    InfLVG: Reinforce inference- time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

    Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. InfLVG: Reinforce inference- time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

  30. [30]

    Video-T1: Test-time scaling for video generation

    Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, and Yueqi Duan. Video-T1: Test-time scaling for video generation. InInt. Conf. Comput. Vis., pages 18671–18681, 2025

  31. [31]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  32. [32]

    Enhance-A-Video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

    Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-A-Video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

  33. [33]

    Optical-flow guided prompt optimization for coherent video generation

    Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7837–7846, 2025

  34. [34]

    FreeNoise: Tuning-free longer video diffusion via noise rescheduling

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. FreeNoise: Tuning-free longer video diffusion via noise rescheduling. InInt. Conf. Learn. Represent., 2024

  35. [35]

    FreeLong: Training-free long video generation with spectralblend temporal attention

    Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. FreeLong: Training-free long video generation with spectralblend temporal attention. InAdv. Neural Inform. Process. Syst., pages 131434–131455, 2024

  36. [36]

    LongDiff: Training-free long video generation in one go

    Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. LongDiff: Training-free long video generation in one go. InIEEE Conf. Comput. Vis. Pattern Recog., pages 17789–17798, 2025

  37. [37]

    VideoGuide: Improving video diffusion models without training through a teacher’s guide

    Dohun Lee, Bryan Sangwoo Kim, Geon Yeong Park, and Jong Chul Ye. VideoGuide: Improving video diffusion models without training through a teacher’s guide. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2599–2608, 2025

  38. [38]

    Pascal Chang, Jingwei Tang, Markus Gross, and Vinicius C. Azevedo. How i warped your noise: a temporally-correlated noise prior for diffusion models. InInt. Conf. Learn. Represent., 2024

  39. [39]

    Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

    Mariam Hassan, Bastien Van Delft, Wuyang Li, and Alexandre Alahi. Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

  40. [40]

    Structure from tracking: Distilling structure-preserving motion for video generation.arXiv preprint arXiv:2512.11792, 2025

    Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, and Benlin Liu. Structure from tracking: Distilling structure-preserving motion for video generation.arXiv preprint arXiv:2512.11792, 2025. 11

  41. [41]

    arXiv preprint arXiv:2505.13344 (2025)

    Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, and Aysegul Dundar. RoPECraft: Training- free motion transfer with trajectory-guided rope optimization on diffusion transformers.arXiv preprint arXiv:2505.13344, 2025

  42. [42]

    MotionRAG: Motion retrieval- augmented image-to-video generation.arXiv preprint arXiv:2509.26391, 2025

    Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, and Limin Wang. MotionRAG: Motion retrieval- augmented image-to-video generation.arXiv preprint arXiv:2509.26391, 2025

  43. [43]

    CFG-Zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

    Weichen Fan, Amber Yijia Zheng, Raymond A Yeh, and Ziwei Liu. CFG-Zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

  44. [44]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

  45. [45]

    Diffusion rejection sampling

    Byeonghu Na, Yeongmin Kim, Minsang Park, Donghyeok Shin, Wanmo Kang, and Il Chul Moon. Diffusion rejection sampling. InInt. Conf. Mach. Learn., volume 235, pages 37097–37121, 2024

  46. [46]

    Test-time scaling of diffusion models via noise trajectory search

    Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search. InAdv. Neural Inform. Process. Syst., 2025

  47. [47]

    arXiv preprint arXiv:2506.01144 (2025)

    Ariel Shaulov, Itay Hazan, Lior Wolf, and Hila Chefer. FlowMo: Variance-based flow guidance for coherent motion in video generation.arXiv preprint arXiv:2506.01144, 2025

  48. [48]

    Improved video vae for latent video diffusion model

    Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video vae for latent video diffusion model. InIEEE Conf. Comput. Vis. Pattern Recog., pages 18124–18133, 2025

  49. [49]

    Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring

    Xiaoqian Lv, Shengping Zhang, Chenyang Wang, Yichen Zheng, Bineng Zhong, Chongyi Li, and Liqiang Nie. Fourier priors-guided diffusion for zero-shot joint low-light enhancement and deblurring. InIEEE Conf. Comput. Vis. Pattern Recog., pages 25378–25388, 2024

  50. [50]

    FreeU: Free lunch in diffusion u-net

    Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. FreeU: Free lunch in diffusion u-net. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4733–4743, 2024

  51. [51]

    FAM Diffusion: Frequency and attention modulation for high-resolution image generation with stable diffusion

    Haosen Yang, Adrian Bulat, Isma Hadji, Hai X Pham, Xiatian Zhu, Georgios Tzimiropoulos, and Brais Martinez. FAM Diffusion: Frequency and attention modulation for high-resolution image generation with stable diffusion. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2459–2468, 2025

  52. [52]

    VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025

  53. [53]

    VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

    Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. VideoScore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025

  54. [54]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21807–21818, 2024

  55. [55]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

  56. [56]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 12 A Summary of the Appendix We provide additional details and comprehensive analys...