EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

Fangneng Zhan; Hongyu Liu; Paul Liang; Qifeng Chen; Qinghe Wang; Shanhui Mo; Xinyu Wang; Xu Ye; Yinhan Zhang; Yuanpeng Che

arxiv: 2605.22051 · v1 · pith:CD73IVGInew · submitted 2026-05-21 · 💻 cs.CV

EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

Yue Ma , Xu Ye , Qinghe Wang , Yucheng Wang , Hongyu Liu , Yinhan Zhang , Xinyu Wang , Yuanpeng Che

show 4 more authors

Shanhui Mo Paul Liang Fangneng Zhan Qifeng Chen

This is my paper

Pith reviewed 2026-05-22 06:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords Visual Effects GenerationFrequency Domain DecompositionMixture of ExpertsTest-Time TrainingResource Efficient GenerationComputer VisionGenerative Models

0 comments

The pith

EasyVFX decouples high-frequency spatial textures from low-frequency motion dynamics to generate realistic VFX with far less data and compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EasyVFX as a way to make professional-grade visual effects generation practical under tight resource limits. It starts from the observation that VFX complexity comes from tightly coupled spatial details and temporal motions, and shows that separating them by frequency bands turns the problem into simpler sub-tasks. A two-stage process first trains a frequency-aware mixture of experts on broad priors using limited GPUs, then adapts the model to new effects in roughly 100 steps on a single GPU via a frequency-constraint loss. A sympathetic reader cares because current VFX pipelines demand massive datasets and hardware that most creators cannot access; if the separation holds, high-fidelity effects become reachable without those barriers.

Core claim

By decomposing VFX into high-frequency components that capture intricate spatial appearances and low-frequency components that capture global motion dynamics, the high-dimensional learning task reduces to manageable sub-problems. This spectral disentanglement is realized through a Frequency-aware Mixture-of-Experts architecture with soft routing across spectral bands, followed by test-time training that uses a Frequency-constraint Loss to adapt the pre-trained model to specific unseen effects with minimal steps and resources.

What carries the argument

Frequency-aware Mixture-of-Experts (Freq-MoE) that routes experts to distinct spectral bands via soft assignment, combined with a Frequency-constraint Loss for rapid test-time adaptation.

If this is right

Specialized experts acquire foundational VFX knowledge using fewer GPU resources than standard end-to-end training.
New effects can be synthesized after only about 100 adaptation steps on a single GPU.
The resulting outputs maintain structural consistency while matching the visual fidelity of high-cost pipelines.
Overall data requirements drop because each sub-task focuses on a narrower frequency range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frequency split might simplify other generative video or animation tasks where fine detail and coarse motion already separate naturally.
If adaptation stays this light, on-device or small-studio VFX editing tools become feasible without cloud-scale training.
Extending the routing mechanism to additional frequency bands or modalities could further reduce compute for longer sequences.

Load-bearing premise

Separating high-frequency spatial appearances from low-frequency global motion dynamics substantially reduces VFX complexity and makes optimization easier.

What would settle it

A controlled experiment that trains identical models with and without the frequency decomposition on the same VFX dataset and shows whether the non-decomposed version requires significantly more data, more GPU hours, or more than roughly 100 adaptation steps to reach comparable structural consistency and visual quality.

Figures

Figures reproduced from arXiv: 2605.22051 by Fangneng Zhan, Hongyu Liu, Paul Liang, Qifeng Chen, Qinghe Wang, Shanhui Mo, Xinyu Wang, Xu Ye, Yinhan Zhang, Yuanpeng Che, Yucheng Wang, Yue Ma.

**Figure 1.** Figure 1: Showcase of visual effect generation produced by our proposed EasyVFX, which allows users to perform high-fidelity visual effect generation while maintaining coherent structures and temporal consistency following the reference video. The reference videos are shown in the lower left corner of the generated results, respectively. The input images are synthetic images produced by publicly available LoRA model… view at source ↗

**Figure 2.** Figure 2: Showcase of visual effect generation produced by our proposed EasyVFX, which allows users to perform high-fidelity visual effect generation while maintaining coherent structures and temporal consistency following the reference video. The reference videos are shown in the lower left corner of the generated results, respectively. The input images are synthetic images produced by publicly available LoRA model… view at source ↗

**Figure 3.** Figure 3: Analysis of spectral frequency components during visual effect generation. The left part plots the overall spectral energy evolution over diffusion time steps across normalized frequencies. The right panels visualize intermediate spatial features at select steps (t = 29, 45, 79), demonstrating that high-frequency components (bottom rows) consistently encode fine-grained visual effect details, while low-fr… view at source ↗

**Figure 4.** Figure 4: Comparison of training efficiency and generation quality. We compare EasyVFX with video generation and VFX baselines under the experimental settings described in Sec. ??. All CogVideoX-based methods are initialized from the same pretrained CogVideoX checkpoint, and the reported CogVideoX cost refers to fine-tuning on OpenVFX rather than full training from scratch. EasyVFX achieves a favorable trade-off be… view at source ↗

**Figure 5.** Figure 5: Overview of our method. Our method consists of two stages. Stage 1: Frequency-aware MoE Training. We employ a 3D VAE [19] to project input videos into a latent space. The core architecture uses a Frequency-aware Mixture-of-Experts (Freq-MoE) adapter, where a lightweight router assigns soft weights to LoRA experts according to coarse spectral energy cues extracted from noisy latents. This encourages the ada… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison results with existing methods. EasyVFX produces visually plausible effects and coherent temporal evolution by leveraging frequency-aware expert routing to capture coarse effect appearance and motion cues [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visual ablation of proposed Freq-MoE and Freqconstraint Loss. We compare the results without the Freq-MoE module (left) and without the Frequency-constraint Loss (right). The ablation study demonstrates that Freq-MoE is essential for better performance, while the Freq-constraint Loss plays a critical role in preserving high-frequency textures [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 1.** Figure 1: User Study Metric. Given an input prompt and the generated result, raters score (i) motion fidelity, (ii) appearance consistency, (iii) temporal consistency, and (iv) text similarity. We report the four scores as a vector to preserve per-aspect performance. C. More comparison S3 D. Limitation S3 E. Social potential impact S3 A. Implementary details All video inputs are resized to 480p resolution and unif… view at source ↗

**Figure 2.** Figure 2: User study results. We compare our method against state-of-the-art baselines across four metrics: appearance consistency, text similarity, temporal consistency, and motion fidelity. The results demonstrate that our approach is significantly preferred by human evaluators, consistently achieving the highest ratios of ’Excellent’ and ’Good’ ratings. Reference video Generated video [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 3.** Figure 3: Failure cases. Our method struggles in highly dynamic scenes with large motion or complex interactions, leading to noticeable artifacts and temporal inconsistencies. • Motion Fidelity: Does the generated video reproduce motion patterns that are consistent with the reference effect video, without obvious drifting, motion collapse, or unnatural acceleration? • Appearance Consistency: Does the generated vis… view at source ↗

**Figure 4.** Figure 4: More qualitative comparison results with the state-of-the-art methods [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Generating high-fidelity visual effects (VFX) typically demands massive datasets and prohibitive computational power due to the intricate coupling of spatial textures and temporal dynamics. In this paper, we introduce EasyVFX, a resource-efficient framework that achieves realistic VFX synthesis under stringent constraints. Our core philosophy lies in frequency-domain decomposition: we observe that the complexity of VFX can be significantly mitigated by decoupling high-frequency components, which represent intricate spatial appearances, from low-frequency components that encapsulate global motion dynamics. This spectral disentanglement transforms a high-dimensional learning problem into manageable sub-tasks, thereby lowering the optimization barrier and reducing data dependency. Building upon this insight, we propose a two-stage training paradigm. First, we design a Frequency-aware Mixture-of-Experts (Freq-MoE) architecture. By utilizing a soft routing mechanism, our model assigns specialized experts to distinct spectral bands, enabling them to cultivate robust priors for appearance and motion dynamics. This specialization allows the model to acquire foundational VFX knowledge with fewer GPU resources. Second, we introduce a Test-Time Training strategy powered by a novel Frequency-constraint Loss. This allows the pre-trained model to swiftly adapt to specific, unseen effects through localized optimizations, requiring only about 100 steps on a single GPU. Experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects, proving that frequency-aware learning is a key catalyst for democratizing professional-grade VFX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EasyVFX combines frequency decomposition with MoE routing and test-time adaptation to target resource-efficient VFX, but the high-frequency spatial vs low-frequency motion split looks unreliable for many common effects.

read the letter

The main point to know is that EasyVFX frames VFX generation as a frequency decoupling problem and builds a two-stage system around it: a Frequency-aware MoE that routes experts to different spectral bands during pretraining, followed by a Frequency-constraint Loss for quick test-time adaptation on new effects using roughly 100 steps on one GPU. This is presented as a way to lower data and compute demands compared to standard coupled approaches.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EasyVFX, a resource-efficient framework for high-fidelity VFX generation. It claims that frequency-domain decomposition decouples high-frequency components (intricate spatial appearances) from low-frequency components (global motion dynamics), transforming a coupled high-dimensional problem into independent sub-tasks. The approach uses a two-stage paradigm with a Frequency-aware Mixture-of-Experts (Freq-MoE) architecture employing soft routing for spectral-band specialization, followed by test-time training via a novel Frequency-constraint Loss that enables adaptation to unseen effects in roughly 100 steps on a single GPU. The abstract asserts that this yields structurally consistent and visually stunning results while lowering data dependency and computational barriers.

Significance. If the central claims are substantiated with rigorous evidence, the work could meaningfully advance resource-efficient generative modeling in computer vision by demonstrating that frequency-aware specialization reduces the optimization barrier for complex VFX synthesis. This would support broader accessibility to professional-grade effects without massive datasets or GPU clusters, with the Freq-MoE and frequency-constraint mechanisms offering a concrete architectural path toward that goal.

major comments (2)

[Abstract] Abstract: the manuscript asserts that 'experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects' yet supplies no quantitative metrics, baselines, datasets, ablation studies, or implementation details. This absence directly prevents evaluation of whether the reported resource savings and lowered optimization barrier are supported by the data.
[Abstract] Abstract: the core philosophy states that high-frequency components represent 'intricate spatial appearances' while low-frequency components 'encapsulate global motion dynamics,' underpinning both Freq-MoE routing and the Frequency-constraint Loss. This clean spectral split is load-bearing for the claimed complexity reduction; however, many VFX phenomena (fast particle motion, fluid turbulence, flickering lights) contain high-frequency content in the temporal domain, which risks misrouting and undermines the decoupling premise.

minor comments (1)

[Abstract] The phrase 'about 100 steps' is imprecise; reporting the exact step count, learning-rate schedule, or convergence criterion used in the test-time training would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting these important points. We respond to each comment below, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript asserts that 'experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects' yet supplies no quantitative metrics, baselines, datasets, ablation studies, or implementation details. This absence directly prevents evaluation of whether the reported resource savings and lowered optimization barrier are supported by the data.

Authors: We acknowledge that the abstract does not contain specific quantitative metrics or other details. These are provided in the body of the manuscript, particularly in the Experiments and Implementation sections, where we present comparisons, ablations, and resource usage statistics. To improve clarity for readers, we will revise the abstract to include a short statement summarizing the key quantitative results and efficiency gains. revision: yes
Referee: [Abstract] Abstract: the core philosophy states that high-frequency components represent 'intricate spatial appearances' while low-frequency components 'encapsulate global motion dynamics,' underpinning both Freq-MoE routing and the Frequency-constraint Loss. This clean spectral split is load-bearing for the claimed complexity reduction; however, many VFX phenomena (fast particle motion, fluid turbulence, flickering lights) contain high-frequency content in the temporal domain, which risks misrouting and undermines the decoupling premise.

Authors: We appreciate the referee raising this potential issue with the frequency decoupling assumption. In our framework, the frequency decomposition is applied spatially to disentangle appearance details from motion structures, while temporal aspects are handled through the video sequence modeling and the specialized experts. The soft routing in Freq-MoE provides adaptability, and the Frequency-constraint Loss during test-time training helps in capturing complex dynamics. We will expand the manuscript with a discussion on handling temporal high-frequency VFX elements and include supporting examples. revision: yes

Circularity Check

0 steps flagged

No circularity: core claim rests on stated observation, not self-referential derivation

full rationale

The paper presents frequency-domain decomposition as an initial observation that high-frequency components capture spatial appearances while low-frequency ones capture motion dynamics. This observation is used to motivate the Freq-MoE architecture and Frequency-constraint Loss, but no equations, derivations, or fitted parameters are shown to reduce back to themselves by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no predictions are claimed from subsets of data that would force the result. The framework is therefore self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Ledger populated from abstract claims only; full paper would be needed to identify additional fitted parameters or background assumptions.

axioms (1)

domain assumption Complexity of VFX can be significantly mitigated by decoupling high-frequency components (intricate spatial appearances) from low-frequency components (global motion dynamics)
Presented as the core observation that transforms the learning problem.

invented entities (2)

Frequency-aware Mixture-of-Experts (Freq-MoE) no independent evidence
purpose: Assigns specialized experts to distinct spectral bands via soft routing
New architecture component introduced for spectral specialization
Frequency-constraint Loss no independent evidence
purpose: Powers test-time training for swift adaptation to unseen effects
Novel loss function for localized optimization in ~100 steps

pith-pipeline@v0.9.0 · 5821 in / 1162 out tokens · 37049 ms · 2026-05-22T06:25:50.163415+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean, IndisputableMonolith/Foundation/AlexanderDuality.lean reality_from_one_distinction, alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decoupling high-frequency components, which represent intricate spatial appearances, from low-frequency components that encapsulate global motion dynamics... Frequency-aware Mixture-of-Experts (Freq-MoE) architecture... Frequency-constraint Loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 12 internal anchors

[1]

com/models, 2026

Civitai: Ai model sharing platform.https://civitai. com/models, 2026. Accessed: 2026-04-28. 1, 3

work page 2026
[2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Video-as-prompt: Uni- fied semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, and Qiang Xu. Video-as-prompt: Uni- fied semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025. 2, 7, 8

work page arXiv 2025
[4]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 3

work page 2024
[5]

Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video dif- fusion transformers.arXiv preprint arXiv:2504.02436, 2025. 3

work page arXiv 2025
[6]

Dit4edit: Dif- fusion transformer for image editing

Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: Dif- fusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2969– 2977, 2025. 3

work page 2025
[7]

Motion prompting: Controlling video generation with motion trajec- tories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion trajec- tories. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1–12, 2025. 3

work page 2025
[8]

Magicvfx: Visual effects synthesis in just minutes

Jiaqi Guo, Lianli Gao, Junchen Zhu, Jiaxin Zhang, Siyang Li, and Jingkuan Song. Magicvfx: Visual effects synthesis in just minutes. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8238–8246, 2024. 4

work page 2024
[9]

Live- portrait: Efficient portrait animation with stitching and re- targeting control.arXiv preprint arXiv:2407.03168, 2024

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Livepor- trait: Efficient portrait animation with stitching and retarget- ing control.arXiv preprint arXiv:2407.03168, 2024. 3

work page arXiv 2024
[10]

Sparsectrl: Adding sparse controls to text-to-video diffusion models

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 330–348. Springer, 2024. 3

work page 2024
[11]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. S1

work page 2022
[13]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022
[14]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

work page 2024
[15]

Embedding-perturbed Exploration Preference Optimization for Flow Models

Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangx- iang Chu, and Xiu Li. Embedding-perturbed exploration preference optimization for flow models.arXiv preprint arXiv:2605.15803, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Video- mage: Multi-subject and motion customization of text-to- video diffusion models

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, and Yu-Chiang Frank Wang. Video- mage: Multi-subject and motion customization of text-to- video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17603– 17612, 2025. 3

work page 2025
[17]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 7

work page 2024
[18]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 5

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv:2412.03603, 2024. 3, S2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Di- traj: training-free trajectory control for video diffusion trans- former.arXiv preprint arXiv:2509.21839, 2025

Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, and Zhicheng Zhao. Di- traj: training-free trajectory control for video diffusion trans- former.arXiv preprint arXiv:2509.21839, 2025. 3

work page arXiv 2025
[22]

Vfxmaster: Unlocking dy- namic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xi- aoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yun- zhi Zhuge, Huchuan Lu, et al. Vfxmaster: Unlocking dy- namic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025. 2, 4, 7

work page arXiv 2025
[23]

Motionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Motionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024. 3

work page arXiv 2024
[24]

Avatarartist: Open-domain 4d avatarization

Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, and Qifeng Chen. Avatarartist: Open-domain 4d avatarization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10758–10769, 2025. 3

work page 2025
[25]

Vfx creator: Animated visual ef- fect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, and Yike Guo. Vfx creator: Animated visual ef- fect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025. 2, 4, 7, 8, S3

work page arXiv 2025
[26]

Follow-your-shape: Shape-aware image edit- ing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, and Yue Ma. Follow-your-shape: Shape-aware image edit- ing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025. 3

work page arXiv 2025
[27]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Visual knowledge graph for human action rea- soning in videos

Yue Ma, Yali Wang, Yue Wu, Ziyu Lyu, Siran Chen, Xiu Li, and Yu Qiao. Visual knowledge graph for human action rea- soning in videos. InProceedings of the 30th ACM Interna- tional Conference on Multimedia, pages 4132–4141, 2022. 3

work page 2022
[29]

Follow your pose: Pose- guided text-to-video generation using pose-free videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4117–4125, 2024. 3

work page 2024
[30]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024. 3

work page 2024
[31]

Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,

Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,

work page arXiv
[32]

Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, and Qifeng Chen. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025. 3

work page arXiv 2025
[33]

Follow-your-click: Open-domain regional image animation via motion prompts

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain regional image animation via motion prompts. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6018–6026, 2025. 3

work page 2025
[34]

Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, and Qifeng Chen. Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025. 3

work page arXiv 2025
[35]

Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, et al. Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

work page arXiv 2025
[36]

Group editing: Edit multiple im- ages in one go.arXiv preprint arXiv:2603.22883, 2026

Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, et al. Group editing: Edit multiple im- ages in one go.arXiv preprint arXiv:2603.22883, 2026

work page arXiv 2026
[37]

Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026

Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Mark Fong, Yuxuan Xue, Zixi- ang Zhao, Konrad Schindler, et al. Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026. 3

work page arXiv 2026
[38]

Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Ji- ahong Wu, and Xiangxiang Chu. Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025. 4, 7, 8, S3

work page arXiv 2025
[39]

Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 3

work page arXiv 2024
[40]

Jensen, Zhenli Sheng, and Bin Yang

Xiangfei Qiu, Jilin Hu, Lekui Zhou, Xingjian Wu, Junyang Du, Buang Zhang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Zhenli Sheng, and Bin Yang. TFB: Towards com- prehensive and fair benchmarking of time series forecasting methods. InProc. VLDB Endow., pages 2363–2377, 2024. 3

work page 2024
[41]

Dbloss: Decomposition-based loss function for time series forecast- ing

Xiangfei Qiu, Xingjian Wu, Hanyin Cheng, Xvyuan Liu, Chenjuan Guo, Jilin Hu, and Bin Yang. Dbloss: Decomposition-based loss function for time series forecast- ing. InNeurIPS, 2025

work page 2025
[42]

DUET: Dual clustering enhanced mul- tivariate time series forecasting

Xiangfei Qiu, Xingjian Wu, Yan Lin, Chenjuan Guo, Jilin Hu, and Bin Yang. DUET: Dual clustering enhanced mul- tivariate time series forecasting. InSIGKDD, pages 1185– 1196, 2025

work page 2025
[43]

Dag: A dual correlation network for time series forecasting with exogenous variables

Xiangfei Qiu, Yuhan Zhu, Zhengyu Li, Xingjian Wu, Bin Yang, and Jilin Hu. Dag: A dual correlation network for time series forecasting with exogenous variables. InICML,

work page
[44]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7

work page 2021
[45]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, and Yue Ma. Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

work page arXiv
[47]

Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

Yiren Song, Danze Chen, and Mike Zheng Shou. Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025. 3

work page arXiv 2025
[48]

Makeany- thing: Harnessing diffusion transformers for multi- domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025

Yiren Song, Cheng Liu, and Mike Zheng Shou. Makeany- thing: Harnessing diffusion transformers for multi- domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025. 3

work page arXiv 2025
[49]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Frame in-n-out: Unbounded con- trollable image-to-video generation.arXiv preprint arXiv:2505.21491, 2025

Boyang Wang, Xuweiyi Chen, Matheus Gadelha, and Zezhou Cheng. Frame in-n-out: Unbounded con- trollable image-to-video generation.arXiv preprint arXiv:2505.21491, 2025. 3

work page arXiv 2025
[51]

Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024. 3

work page arXiv 2024
[52]

Point-to-point video gen- eration

Tsun-Hsuan Wang, Yen-Chi Cheng, Chieh Hubert Lin, Hwann-Tzong Chen, and Min Sun. Point-to-point video gen- eration. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 10491–10500, 2019. 3

work page 2019
[53]

Unianimate: Taming unified video diffusion models for con- sistent human image animation.Science China Information Sciences, 68(10):1–14, 2025

Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xi- aoqiang Zhou, Yingya Zhang, Luxin Yan, and Nong Sang. Unianimate: Taming unified video diffusion models for con- sistent human image animation.Science China Information Sciences, 68(10):1–14, 2025. 3

work page 2025
[54]

Customvideo: Customizing text-to-video generation with multiple subjects.arXiv:2401.09962, 2024

Zhao Wang et al. Customvideo: Customizing text-to-video generation with multiple subjects.arXiv:2401.09962, 2024. 3

work page arXiv 2024
[55]

Dreamvideo: Composing your dream videos with customized subject and motion

Yujie Wei et al. Dreamvideo: Composing your dream videos with customized subject and motion. InCVPR, 2024. 3

work page 2024
[56]

Draganything: Motion control for any- thing using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for any- thing using entity representation. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024. 3

work page 2024
[57]

Make-your-video: Customized video generation using textual and structural guidance.IEEE Transactions on Visualization and Computer Graphics, 31 (2):1526–1541, 2024

Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.IEEE Transactions on Visualization and Computer Graphics, 31 (2):1526–1541, 2024. 3

work page 2024
[58]

Smrabooth: Subject and motion representation align- ment for customized video generation.arXiv preprint arXiv:2512.12193, 2025

Xuancheng Xu, Yaning Li, Sisi You, and Bing-Kun Bao. Smrabooth: Subject and motion representation align- ment for customized video generation.arXiv preprint arXiv:2512.12193, 2025. 3

work page arXiv 2025
[59]

Clgc: Con- tinuous layout guidance for consistent text-to-video editing

Xuancheng Xu, Ming Tao, and Bing-Kun Bao. Clgc: Con- tinuous layout guidance for consistent text-to-video editing. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 3

work page 2025
[60]

Magicanimate: Temporally consistent human im- age animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024. 3

work page 2024
[61]

Human-3diffusion: realistic avatar creation via explicit 3d consistent diffusion models.Advances in Neural Information Processing Systems, 37:99601–99645, 2024

Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Human-3diffusion: realistic avatar creation via explicit 3d consistent diffusion models.Advances in Neural Information Processing Systems, 37:99601–99645, 2024. 3

work page 2024
[62]

Infinihuman: Realistic 3d human creation with precise control

Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Infinihuman: Realistic 3d human creation with precise control. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025. 3

work page 2025
[63]

GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

Yuxuan Xue, Ruofan Liang, Egor Zakharov, Timur Bagaut- dinov, Chen Cao, Giljoo Nam, Shunsuke Saito, Gerard Pons-Moll, and Javier Romero. Georelight: Learning joint geometrical relighting and reconstruction with flex- ible multi-modal diffusion transformers.arXiv preprint arXiv:2604.20715, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Eedit: Rethinking the spatial and temporal redundancy for efficient image editing

Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. Eedit: Rethinking the spatial and temporal redundancy for efficient image editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 17474–17484, 2025. 3

work page 2025
[65]

VideoCoF: Unified Video Editing with Temporal Reasoner

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. Unified video editing with temporal reasoner. arXiv preprint arXiv:2512.07469, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 7, 8, S1, S3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Through-the-mask: Mask-based motion trajecto- ries for image-to-video generation

Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, and Adam Polyak. Through-the-mask: Mask-based motion trajecto- ries for image-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18198–18208, 2025. 3

work page 2025
[68]

Fera: Frequency-energy con- strained routing for effective diffusion adaptation fine- tuning, 2025

Bo Yin, Xiaobin Hu, Xingyu Zhou, Peng-Tao Jiang, Yue Liao, Junwei Zhu, Jiangning Zhang, Ying Tai, Chengjie Wang, and Shuicheng Yan. Fera: Frequency-energy con- strained routing for effective diffusion adaptation fine- tuning, 2025. 5

work page 2025
[69]

The latent space: Foundation, evolution, mecha- nism, ability, and outlook.arXiv preprint arXiv:2604.02029,

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mecha- nism, ability, and outlook.arXiv preprint arXiv:2604.02029,

work page arXiv
[70]

Flexiact: Towards flexible action control in heterogeneous scenarios

Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 3

work page 2025
[71]

Flexiact: Towards flexible action control in heterogeneous scenarios, 2025

Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios, 2025. 5

work page 2025
[72]

Ssr-encoder: Encoding selective subject representation for subject-driven generation

Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024. 3

work page 2024
[73]

Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025. 3

work page arXiv 2025
[74]

Motionpro: A precise mo- tion controller for image-to-video generation

Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, and Tao Mei. Motionpro: A precise mo- tion controller for image-to-video generation. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 27957–27967, 2025. 3

work page 2025
[75]

Holotime: Taming video dif- fusion models for panoramic 4d scene generation

Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, and Li Yuan. Holotime: Taming video dif- fusion models for panoramic 4d scene generation. InPro- ceedings of the 33rd ACM International Conference on Mul- timedia, pages 9763–9772, 2025. 3

work page 2025
[76]

Champ: Controllable and consistent human image an- imation with 3d parametric guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 3

work page 2024
[77]

Synthesizing videos from images for image-to-video adaptation

Junbao Zhuo, Xingyu Zhao, Shuhui Wang, Huimin Ma, and Qingming Huang. Synthesizing videos from images for image-to-video adaptation. InProceedings of the 31st ACM International Conference on Multimedia, pages 8294–8303,

work page

[1] [1]

com/models, 2026

Civitai: Ai model sharing platform.https://civitai. com/models, 2026. Accessed: 2026-04-28. 1, 3

work page 2026

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Video-as-prompt: Uni- fied semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, and Qiang Xu. Video-as-prompt: Uni- fied semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025. 2, 7, 8

work page arXiv 2025

[4] [4]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 3

work page 2024

[5] [5]

Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video dif- fusion transformers.arXiv preprint arXiv:2504.02436, 2025. 3

work page arXiv 2025

[6] [6]

Dit4edit: Dif- fusion transformer for image editing

Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: Dif- fusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2969– 2977, 2025. 3

work page 2025

[7] [7]

Motion prompting: Controlling video generation with motion trajec- tories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion trajec- tories. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1–12, 2025. 3

work page 2025

[8] [8]

Magicvfx: Visual effects synthesis in just minutes

Jiaqi Guo, Lianli Gao, Junchen Zhu, Jiaxin Zhang, Siyang Li, and Jingkuan Song. Magicvfx: Visual effects synthesis in just minutes. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8238–8246, 2024. 4

work page 2024

[9] [9]

Live- portrait: Efficient portrait animation with stitching and re- targeting control.arXiv preprint arXiv:2407.03168, 2024

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Livepor- trait: Efficient portrait animation with stitching and retarget- ing control.arXiv preprint arXiv:2407.03168, 2024. 3

work page arXiv 2024

[10] [10]

Sparsectrl: Adding sparse controls to text-to-video diffusion models

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 330–348. Springer, 2024. 3

work page 2024

[11] [11]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. S1

work page 2022

[13] [13]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022

[14] [14]

Animate anyone: Consistent and controllable image- to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image- to-video synthesis for character animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024. 3

work page 2024

[15] [15]

Embedding-perturbed Exploration Preference Optimization for Flow Models

Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangx- iang Chu, and Xiu Li. Embedding-perturbed exploration preference optimization for flow models.arXiv preprint arXiv:2605.15803, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Video- mage: Multi-subject and motion customization of text-to- video diffusion models

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, and Yu-Chiang Frank Wang. Video- mage: Multi-subject and motion customization of text-to- video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17603– 17612, 2025. 3

work page 2025

[17] [17]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 7

work page 2024

[18] [18]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 5

work page internal anchor Pith review Pith/arXiv arXiv 2013

[20] [20]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv:2412.03603, 2024. 3, S2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Di- traj: training-free trajectory control for video diffusion trans- former.arXiv preprint arXiv:2509.21839, 2025

Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, and Zhicheng Zhao. Di- traj: training-free trajectory control for video diffusion trans- former.arXiv preprint arXiv:2509.21839, 2025. 3

work page arXiv 2025

[22] [22]

Vfxmaster: Unlocking dy- namic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xi- aoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yun- zhi Zhuge, Huchuan Lu, et al. Vfxmaster: Unlocking dy- namic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025. 2, 4, 7

work page arXiv 2025

[23] [23]

Motionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Motionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024. 3

work page arXiv 2024

[24] [24]

Avatarartist: Open-domain 4d avatarization

Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, and Qifeng Chen. Avatarartist: Open-domain 4d avatarization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10758–10769, 2025. 3

work page 2025

[25] [25]

Vfx creator: Animated visual ef- fect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, and Yike Guo. Vfx creator: Animated visual ef- fect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025. 2, 4, 7, 8, S3

work page arXiv 2025

[26] [26]

Follow-your-shape: Shape-aware image edit- ing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, and Yue Ma. Follow-your-shape: Shape-aware image edit- ing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025. 3

work page arXiv 2025

[27] [27]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Visual knowledge graph for human action rea- soning in videos

Yue Ma, Yali Wang, Yue Wu, Ziyu Lyu, Siran Chen, Xiu Li, and Yu Qiao. Visual knowledge graph for human action rea- soning in videos. InProceedings of the 30th ACM Interna- tional Conference on Multimedia, pages 4132–4141, 2022. 3

work page 2022

[29] [29]

Follow your pose: Pose- guided text-to-video generation using pose-free videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4117–4125, 2024. 3

work page 2024

[30] [30]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024. 3

work page 2024

[31] [31]

Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,

Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,

work page arXiv

[32] [32]

Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, and Qifeng Chen. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025. 3

work page arXiv 2025

[33] [33]

Follow-your-click: Open-domain regional image animation via motion prompts

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain regional image animation via motion prompts. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6018–6026, 2025. 3

work page 2025

[34] [34]

Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, and Qifeng Chen. Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025. 3

work page arXiv 2025

[35] [35]

Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, et al. Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

work page arXiv 2025

[36] [36]

Group editing: Edit multiple im- ages in one go.arXiv preprint arXiv:2603.22883, 2026

Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, et al. Group editing: Edit multiple im- ages in one go.arXiv preprint arXiv:2603.22883, 2026

work page arXiv 2026

[37] [37]

Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026

Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Mark Fong, Yuxuan Xue, Zixi- ang Zhao, Konrad Schindler, et al. Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026. 3

work page arXiv 2026

[38] [38]

Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Ji- ahong Wu, and Xiangxiang Chu. Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025. 4, 7, 8, S3

work page arXiv 2025

[39] [39]

Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 3

work page arXiv 2024

[40] [40]

Jensen, Zhenli Sheng, and Bin Yang

Xiangfei Qiu, Jilin Hu, Lekui Zhou, Xingjian Wu, Junyang Du, Buang Zhang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Zhenli Sheng, and Bin Yang. TFB: Towards com- prehensive and fair benchmarking of time series forecasting methods. InProc. VLDB Endow., pages 2363–2377, 2024. 3

work page 2024

[41] [41]

Dbloss: Decomposition-based loss function for time series forecast- ing

Xiangfei Qiu, Xingjian Wu, Hanyin Cheng, Xvyuan Liu, Chenjuan Guo, Jilin Hu, and Bin Yang. Dbloss: Decomposition-based loss function for time series forecast- ing. InNeurIPS, 2025

work page 2025

[42] [42]

DUET: Dual clustering enhanced mul- tivariate time series forecasting

Xiangfei Qiu, Xingjian Wu, Yan Lin, Chenjuan Guo, Jilin Hu, and Bin Yang. DUET: Dual clustering enhanced mul- tivariate time series forecasting. InSIGKDD, pages 1185– 1196, 2025

work page 2025

[43] [43]

Dag: A dual correlation network for time series forecasting with exogenous variables

Xiangfei Qiu, Yuhan Zhu, Zhengyu Li, Xingjian Wu, Bin Yang, and Jilin Hu. Dag: A dual correlation network for time series forecasting with exogenous variables. InICML,

work page

[44] [44]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7

work page 2021

[45] [45]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, and Yue Ma. Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

work page arXiv

[47] [47]

Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

Yiren Song, Danze Chen, and Mike Zheng Shou. Layer- tracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025. 3

work page arXiv 2025

[48] [48]

Makeany- thing: Harnessing diffusion transformers for multi- domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025

Yiren Song, Cheng Liu, and Mike Zheng Shou. Makeany- thing: Harnessing diffusion transformers for multi- domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025. 3

work page arXiv 2025

[49] [49]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Frame in-n-out: Unbounded con- trollable image-to-video generation.arXiv preprint arXiv:2505.21491, 2025

Boyang Wang, Xuweiyi Chen, Matheus Gadelha, and Zezhou Cheng. Frame in-n-out: Unbounded con- trollable image-to-video generation.arXiv preprint arXiv:2505.21491, 2025. 3

work page arXiv 2025

[51] [51]

Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024. 3

work page arXiv 2024

[52] [52]

Point-to-point video gen- eration

Tsun-Hsuan Wang, Yen-Chi Cheng, Chieh Hubert Lin, Hwann-Tzong Chen, and Min Sun. Point-to-point video gen- eration. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 10491–10500, 2019. 3

work page 2019

[53] [53]

Unianimate: Taming unified video diffusion models for con- sistent human image animation.Science China Information Sciences, 68(10):1–14, 2025

Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xi- aoqiang Zhou, Yingya Zhang, Luxin Yan, and Nong Sang. Unianimate: Taming unified video diffusion models for con- sistent human image animation.Science China Information Sciences, 68(10):1–14, 2025. 3

work page 2025

[54] [54]

Customvideo: Customizing text-to-video generation with multiple subjects.arXiv:2401.09962, 2024

Zhao Wang et al. Customvideo: Customizing text-to-video generation with multiple subjects.arXiv:2401.09962, 2024. 3

work page arXiv 2024

[55] [55]

Dreamvideo: Composing your dream videos with customized subject and motion

Yujie Wei et al. Dreamvideo: Composing your dream videos with customized subject and motion. InCVPR, 2024. 3

work page 2024

[56] [56]

Draganything: Motion control for any- thing using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for any- thing using entity representation. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024. 3

work page 2024

[57] [57]

Make-your-video: Customized video generation using textual and structural guidance.IEEE Transactions on Visualization and Computer Graphics, 31 (2):1526–1541, 2024

Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.IEEE Transactions on Visualization and Computer Graphics, 31 (2):1526–1541, 2024. 3

work page 2024

[58] [58]

Smrabooth: Subject and motion representation align- ment for customized video generation.arXiv preprint arXiv:2512.12193, 2025

Xuancheng Xu, Yaning Li, Sisi You, and Bing-Kun Bao. Smrabooth: Subject and motion representation align- ment for customized video generation.arXiv preprint arXiv:2512.12193, 2025. 3

work page arXiv 2025

[59] [59]

Clgc: Con- tinuous layout guidance for consistent text-to-video editing

Xuancheng Xu, Ming Tao, and Bing-Kun Bao. Clgc: Con- tinuous layout guidance for consistent text-to-video editing. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 3

work page 2025

[60] [60]

Magicanimate: Temporally consistent human im- age animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024. 3

work page 2024

[61] [61]

Human-3diffusion: realistic avatar creation via explicit 3d consistent diffusion models.Advances in Neural Information Processing Systems, 37:99601–99645, 2024

Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Human-3diffusion: realistic avatar creation via explicit 3d consistent diffusion models.Advances in Neural Information Processing Systems, 37:99601–99645, 2024. 3

work page 2024

[62] [62]

Infinihuman: Realistic 3d human creation with precise control

Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Infinihuman: Realistic 3d human creation with precise control. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025. 3

work page 2025

[63] [63]

GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

Yuxuan Xue, Ruofan Liang, Egor Zakharov, Timur Bagaut- dinov, Chen Cao, Giljoo Nam, Shunsuke Saito, Gerard Pons-Moll, and Javier Romero. Georelight: Learning joint geometrical relighting and reconstruction with flex- ible multi-modal diffusion transformers.arXiv preprint arXiv:2604.20715, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [64]

Eedit: Rethinking the spatial and temporal redundancy for efficient image editing

Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. Eedit: Rethinking the spatial and temporal redundancy for efficient image editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 17474–17484, 2025. 3

work page 2025

[65] [65]

VideoCoF: Unified Video Editing with Temporal Reasoner

Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. Unified video editing with temporal reasoner. arXiv preprint arXiv:2512.07469, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 7, 8, S1, S3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Through-the-mask: Mask-based motion trajecto- ries for image-to-video generation

Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, and Adam Polyak. Through-the-mask: Mask-based motion trajecto- ries for image-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18198–18208, 2025. 3

work page 2025

[68] [68]

Fera: Frequency-energy con- strained routing for effective diffusion adaptation fine- tuning, 2025

Bo Yin, Xiaobin Hu, Xingyu Zhou, Peng-Tao Jiang, Yue Liao, Junwei Zhu, Jiangning Zhang, Ying Tai, Chengjie Wang, and Shuicheng Yan. Fera: Frequency-energy con- strained routing for effective diffusion adaptation fine- tuning, 2025. 5

work page 2025

[69] [69]

The latent space: Foundation, evolution, mecha- nism, ability, and outlook.arXiv preprint arXiv:2604.02029,

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mecha- nism, ability, and outlook.arXiv preprint arXiv:2604.02029,

work page arXiv

[70] [70]

Flexiact: Towards flexible action control in heterogeneous scenarios

Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 3

work page 2025

[71] [71]

Flexiact: Towards flexible action control in heterogeneous scenarios, 2025

Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios, 2025. 5

work page 2025

[72] [72]

Ssr-encoder: Encoding selective subject representation for subject-driven generation

Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024. 3

work page 2024

[73] [73]

Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025. 3

work page arXiv 2025

[74] [74]

Motionpro: A precise mo- tion controller for image-to-video generation

Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, and Tao Mei. Motionpro: A precise mo- tion controller for image-to-video generation. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 27957–27967, 2025. 3

work page 2025

[75] [75]

Holotime: Taming video dif- fusion models for panoramic 4d scene generation

Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, and Li Yuan. Holotime: Taming video dif- fusion models for panoramic 4d scene generation. InPro- ceedings of the 33rd ACM International Conference on Mul- timedia, pages 9763–9772, 2025. 3

work page 2025

[76] [76]

Champ: Controllable and consistent human image an- imation with 3d parametric guidance

Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Zilong Dong, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image an- imation with 3d parametric guidance. InEuropean Confer- ence on Computer Vision, pages 145–162. Springer, 2024. 3

work page 2024

[77] [77]

Synthesizing videos from images for image-to-video adaptation

Junbao Zhuo, Xingyu Zhao, Shuhui Wang, Huimin Ma, and Qingming Huang. Synthesizing videos from images for image-to-video adaptation. InProceedings of the 31st ACM International Conference on Multimedia, pages 8294–8303,

work page