arxiv: 2503.07598 · v2 · submitted 2025-03-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang , Zhen Han , Chaojie Mao , Jingfeng Zhang , Yulin Pan , Yu Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationvideo editingdiffusion transformerunified modelreference-to-videomasked editingcontext adapter

0 comments

The pith

VACE unifies reference-to-video generation, video-to-video editing, and masked editing in one diffusion transformer model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

VACE introduces a single framework that combines video creation and editing tasks such as reference-to-video generation, video-to-video editing, and masked video-to-video editing. It organizes inputs including references, edits, and masks into a shared Video Condition Unit and uses a Context Adapter to embed task concepts along temporal and spatial dimensions inside a diffusion transformer. This design lets the model handle arbitrary combinations of these tasks without separate specialized networks. A reader would care because it promises to simplify video workflows by replacing multiple models with one that still matches their individual performance levels. The paper reports experimental results showing parity with task-specific systems across subtasks while enabling new combined applications.

Core claim

VACE enables unified video creation and editing by organizing task inputs such as editing, reference, and masking into a Video Condition Unit and injecting different task concepts through a Context Adapter that uses formalized temporal and spatial representations, allowing a single diffusion transformer to handle arbitrary video synthesis tasks with performance on par with task-specific models and support for versatile task combinations.

What carries the argument

The Video Condition Unit that unifies editing, reference, and masking inputs together with the Context Adapter structure that injects task-specific concepts into the model along temporal and spatial dimensions.

If this is right

Users can execute reference-to-video generation, video-to-video editing, and masked editing through one model instead of switching systems.
Task combinations become possible that were previously blocked by the need for separate models.
Performance across subtasks stays comparable to dedicated models rather than falling behind.
The pipeline for video content creation simplifies because a single trained network covers multiple use cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment could become cheaper by maintaining only one large model instead of several task-specific ones.
The same unification pattern might extend to additional tasks such as text-guided video or audio-conditioned editing.
Real-time or interactive video applications could benefit from reduced switching overhead between models.
Generalization across rare or novel task mixes might improve because the shared backbone learns richer video representations.

Load-bearing premise

The Video Condition Unit and Context Adapter can integrate the requirements of reference-to-video, video-to-video, and masked editing tasks into one model without causing performance to drop below that of specialized systems.

What would settle it

A side-by-side benchmark on a standard video dataset where VACE scores more than 5 percent worse than a dedicated reference-to-video model on metrics such as FVD or human preference ratings.

read the original abstract

Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VACE, a unified Diffusion Transformer framework for video creation and editing. It organizes inputs for reference-to-video generation, video-to-video editing, and masked video-to-video editing via a Video Condition Unit (VCU) and injects task concepts through a Context Adapter using formalized temporal and spatial representations. The central claim is that this single model achieves performance on par with task-specific models across subtasks while supporting versatile task combinations.

Significance. If the parity result holds with rigorous evidence, the work would advance unified video synthesis by demonstrating that a single architecture can handle multiple video tasks without degradation, enabling flexible combinations and reducing reliance on specialized models. This addresses the challenge of temporal-spatial consistency in video generation and could influence scalable all-in-one systems in the field.

major comments (2)

[Experiments] Experiments section: the claim that VACE achieves on-par performance with task-specific models lacks reported quantitative metrics (e.g., FVD, FID, CLIP scores), baseline comparisons, ablation studies, or error analysis. Without these, the unification claim cannot be verified and the weakest assumption—that VCU and Context Adapter avoid negative transfer—remains untested.
[Method] Method section describing VCU and Context Adapter: the integration of reference, editing, and masking signals into a unified interface is load-bearing for the no-degradation result, yet no controls or ablations isolate cross-task interference (e.g., potential collisions between mask tokens and reference features in high-motion regions).

minor comments (2)

[Abstract] Abstract: the phrase 'extensive experiments demonstrate' should be supported by explicit mention of datasets, evaluation protocols, and at least one key quantitative result to strengthen the summary.
[Method] Notation: the formalization of temporal and spatial dimensions in the Context Adapter would benefit from an explicit equation or diagram showing how task concepts are injected without overlap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight important areas for improvement in presenting the experimental validation and methodological robustness. We will revise the manuscript to address these points as detailed below.

read point-by-point responses

Referee: [Experiments] Experiments section: the claim that VACE achieves on-par performance with task-specific models lacks reported quantitative metrics (e.g., FVD, FID, CLIP scores), baseline comparisons, ablation studies, or error analysis. Without these, the unification claim cannot be verified and the weakest assumption—that VCU and Context Adapter avoid negative transfer—remains untested.

Authors: We acknowledge the need for more rigorous quantitative support. The original manuscript includes some qualitative results and limited metrics, but to fully verify the on-par performance claim, we will add in the revision: comprehensive tables with FVD, FID, CLIP scores across all tasks, comparisons to multiple baselines, ablation studies on VCU and Context Adapter components, and error analysis for cases where performance differs. This will directly test and support the no negative transfer assumption. revision: yes
Referee: [Method] Method section describing VCU and Context Adapter: the integration of reference, editing, and masking signals into a unified interface is load-bearing for the no-degradation result, yet no controls or ablations isolate cross-task interference (e.g., potential collisions between mask tokens and reference features in high-motion regions).

Authors: We agree that isolating potential interference is crucial. In the revised manuscript, we will include additional ablation studies that separately evaluate the impact of each input type in the VCU and their combinations. Specifically, we will report results on high-motion video segments to check for any collisions or degradations between mask and reference signals. These controls will strengthen the evidence for the unified interface's effectiveness. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents VACE as a novel architecture that organizes inputs via the Video Condition Unit and injects concepts through the Context Adapter, with the central claim of parity to task-specific models resting on empirical validation from extensive experiments rather than any self-referential derivation, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or steps in the provided abstract reduce by construction to the inputs; the unification is introduced as a design choice whose effectiveness is asserted to be shown externally, making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim depends on the effectiveness of two newly introduced components (VCU and Context Adapter) within a standard Diffusion Transformer backbone; no explicit free parameters, axioms, or invented entities with independent evidence are detailed in the abstract.

invented entities (2)

Video Condition Unit (VCU) no independent evidence
purpose: Unified interface that organizes inputs such as editing instructions, references, and masks for different video tasks.
Newly proposed structure to enable task integration; no independent evidence outside the paper.
Context Adapter no independent evidence
purpose: Mechanism to inject task-specific concepts into the model using formalized temporal and spatial representations.
Newly proposed component to support flexible handling of arbitrary tasks; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5512 in / 1272 out tokens · 79489 ms · 2026-05-16T00:48:33.590926+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
cs.CV 2026-04 unverdicted novelty 8.0

ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
MiVE: Multiscale Vision-language features for reference-guided video Editing
cs.CV 2026-05 unverdicted novelty 7.0

MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
cs.CV 2026-04 unverdicted novelty 7.0

DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
Controllable Generative Video Compression
cs.CV 2026-04 unverdicted novelty 7.0

CGVC uses coded keyframes and per-frame priors to guide controllable generative reconstruction of video frames, outperforming prior perceptual compression methods in both signal fidelity and perceptual quality.
Physics-Aware Video Instance Removal Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

FaithfulFaces introduces a pose-faithful identity aligner with a shared dictionary and invariance constraint to maintain facial identity in text-to-video generation under large pose changes and occlusions.
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
cs.CV 2026-04 unverdicted novelty 6.0

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
cs.CV 2026-04 unverdicted novelty 6.0

VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and ...
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.
DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

DiffHDR converts LDR videos to HDR by formulating the task as generative radiance inpainting in a video diffusion model's latent space, using Log-Gamma encoding and synthesized training data to achieve better fidelity...
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
cs.CV 2026-04 unverdicted novelty 6.0

Vanast produces coherent garment-transferred human animation videos from a single human image, garment images, and pose guidance video using synthetic triplet supervision and a Dual Module video diffusion transformer ...
From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
cs.CV 2026-03 unverdicted novelty 6.0

SVOR achieves stable, shadow-free video object removal under real-world imperfections via MUSE mask handling, DA-Seg localization, and curriculum training on real and synthetic data.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
cs.CV 2026-02 unverdicted novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 18 Pith papers · 4 internal anchors

[1]

KLING AI, https://klingai.com/ ,

KLING AI. KLING AI, https://klingai.com/ ,

work page
[2]

Stable Diffusion v1.5 Model Card, https://huggingface.co/runwayml/stable- diffusion-v1-5, 2022

Runway AI. Stable Diffusion v1.5 Model Card, https://huggingface.co/runwayml/stable- diffusion-v1-5, 2022. 2

work page 2022
[3]

Stable Diffusion Inpainting Model Card, https://huggingface.co/runwayml/stable- diffusion-inpainting, 2022

Runway AI. Stable Diffusion Inpainting Model Card, https://huggingface.co/runwayml/stable- diffusion-inpainting, 2022. 2

work page 2022
[4]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structPix2Pix: Learning To Follow Image Editing Instruc- tions. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 18392–18402, 2023. 2, 3

work page 2023
[5]

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell., 43(1):172–186, 2021. 5

work page 2021
[6]

Learning To Generate Line Drawings That Convey Geometry and Se- mantics

Caroline Chan, Fr ´edo Durand, and Phillip Isola. Learning To Generate Line Drawings That Convey Geometry and Se- mantics. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7915–7925, 2022. 5

work page 2022
[7]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Syn- thesis. arXiv preprint arXiv:2310.00426, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation. In Assoc. Adv. Artif. Intell., 2025. 6, 13

work page 2025
[9]

Goku: Flow Based Video Generative Foundation Models

Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow Based Video Generative Foundation Models. arXiv preprint arXiv:2502.04896, 2025. 2

work page arXiv 2025
[10]

Control- A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning.arXiv preprint arXiv:2305.13840, 2023

Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Ji- ashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control- A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning.arXiv preprint arXiv:2305.13840, 2023. 6, 13

work page arXiv 2023
[11]

AnyDoor: Zero-shot Object-level Image Customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. AnyDoor: Zero-shot Object-level Image Customization. arXiv preprint arXiv:2307.09481 ,

work page arXiv
[12]

UniReal: Universal Image Generation and Editing via Learn- ing Real-world Dynamics

Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, and Hengshuang Zhao. UniReal: Universal Image Generation and Editing via Learn- ing Real-world Dynamics. arXiv preprint arXiv:2412.07774,

work page arXiv
[13]

Tongyi Wanxiang, https://tongyi

Alibaba Cloud. Tongyi Wanxiang, https://tongyi. aliyun.com/wanxiang, 2023. 2

work page 2023
[14]

FLATTEN: Optical FLow-guided ATTENtion for consistent text-to-video edit- ing

Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. FLATTEN: Optical FLow-guided ATTENtion for consistent text-to-video edit- ing. In Int. Conf. Learn. Represent., 2024. 6, 7, 13

work page 2024
[15]

UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

Lunhao Duan, Shanshan Zhao, Wenjun Yan, Yinglun Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ming- ming Gong, and Gui-Song Xia. UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation. arXiv preprint arXiv:2412.18928 ,

work page arXiv
[16]

Scaling Rectified Flow Trans- formers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling Rectified Flow Trans- formers for High-Resolution Image Synthesis. In Int. Conf. Mach. Learn., 2024. 2

work page 2024
[17]

Hierar- chical Masked 3D Diffusion Model for Video Outpainting

Fanda Fan, Chaoxu Guo, Litong Gong, Biao Wang, Tiezheng Ge, Yuning Jiang, Chunjie Luo, and Jianfeng Zhan. Hierar- chical Masked 3D Diffusion Model for Video Outpainting. In ACM Int. Conf. Multimedia , pages 7890–7900, 2023. 6, 13

work page 2023
[18]

FLUX, https://blackforestlabs.ai/ ,

FLUX. FLUX, https://blackforestlabs.ai/ ,

work page
[19]

SEED-Data-Edit Technical Report: A Hybrid Dataset for In- structional Image Editing

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. SEED-Data-Edit Technical Report: A Hybrid Dataset for In- structional Image Editing. arXiv preprint arXiv:2405.04007,

work page arXiv
[20]

I2V- Adapter: A General Image-to-Video Adapter for Diffusion Models

Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, Zhengjun Zha, Haibin Huang, and Chongyang Ma. I2V- Adapter: A General Image-to-Video Adapter for Diffusion Models. In ACM SIGGRAPH, pages 1–12, 2024. 2

work page 2024
[21]

PuLID: Pure and Lightning ID Cus- tomization via Contrastive Alignment

Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. PuLID: Pure and Lightning ID Cus- tomization via Contrastive Alignment. In Adv. Neural In- form. Process. Syst., 2024. 2

work page 2024
[22]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime Video Latent Diffusion. arXiv preprint arXiv:2501.00103, 2025. 2, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

ACE: All-round Creator and Editor Following Instructions via Dif- fusion Transformer

Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chao- jie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. ACE: All-round Creator and Editor Following Instructions via Dif- fusion Transformer. In Int. Conf. Learn. Represent., 2025. 2, 3 9

work page 2025
[24]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. In Adv. Neural Inform. Process. Syst., 2021. 2

work page 2021
[25]

Denoising Dif- fusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- fusion Probabilistic Models. In Adv. Neural Inform. Process. Syst. Curran Associates, Inc., 2020. 2

work page 2020
[26]

Composer: Creative and Controllable Im- age Synthesis with Composable Conditions

Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and Controllable Im- age Synthesis with Composable Conditions. In Int. Conf. Mach. Learn., 2023. 2

work page 2023
[27]

VBench: Com- prehensive Benchmark Suite for Video Generative Models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive Benchmark Suite for Video Generative Models. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 21807– 21818, 2024. 5, 7

work page 2024
[29]

Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone

Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Ao Ma, Yiliang Lv, Yujun Shen, Deli Zhao, and Jingren Zhou. Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone. In Adv. Neural Inform. Process. Syst.,

work page
[30]

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

Zeyinzi Jiang, Chaojie Mao, Yulin Pan, Zhen Han, and Jingfeng Zhang. SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8995–9004, 2024. 2

work page 2024
[31]

Text2Video-Zero: Text- to-Image Diffusion Models are Zero-Shot Video Generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2Video-Zero: Text- to-Image Diffusion Models are Zero-Shot Video Generators. In Int. Conf. Comput. Vis., pages 15954–15964, 2023. 6, 13

work page 2023
[32]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan- DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. arXiv preprint arXiv:2405.08748, 2024. 2

work page arXiv 2024
[34]

MagicEdit: High-Fidelity and Temporally Coherent Video Editing

Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, and Jiashi Feng. MagicEdit: High-Fidelity and Temporally Coherent Video Editing. arXiv preprint arXiv:2308.14749,

work page arXiv
[35]

Phantom: Subject- consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment. arXiv preprint arXiv:2502.11079, 2025. 2, 3

work page arXiv 2025
[36]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Video-P2P: Video Editing with Cross-attention Con- trol

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Ji- aya Jia. Video-P2P: Video Editing with Cross-attention Con- trol. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8599– 8608, 2024. 3

work page 2024
[38]

Cones: Concept Neurons in Diffusion Models for Cus- tomized Generation

Zhiheng Liu, Ruili Feng, Kai Zhu, Yifei Zhang, Kecheng Zheng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones: Concept Neurons in Diffusion Models for Cus- tomized Generation. In Int. Conf. Mach. Learn., 2023. 2

work page 2023
[39]

Cones 2: Customizable Image Synthesis with Multiple Subjects

Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones 2: Customizable Image Synthesis with Multiple Subjects. In Adv. Neural Inform. Process. Syst., 2023. 2

work page 2023
[40]

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose- Free Videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, and Qifeng Chen. Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose- Free Videos. In Assoc. Adv. Artif. Intell., 2024. 6, 13

work page 2024
[41]

ACE++: Instruction- Based Image Creation and Editing via Context-Aware Con- tent Filling

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. ACE++: Instruction- Based Image Creation and Editing via Context-Aware Con- tent Filling. arXiv preprint arXiv:2501.02487, 2025. 2, 3

work page arXiv 2025
[42]

SDEdit: Guided Im- age Synthesis and Editing with Stochastic Differential Equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided Im- age Synthesis and Editing with Stochastic Differential Equa- tions. In Int. Conf. Learn. Represent., 2021. 2

work page 2021
[43]

Midjourney, https://www.midjourney

Midjourney. Midjourney, https://www.midjourney. com, 2023. 2

work page 2023
[44]

Hailuo AI Video, https://hailuoai.com/ video, 2024

MiniMax. Hailuo AI Video, https://hailuoai.com/ video, 2024

work page 2024
[45]

DALL·E 3, https://openai.com/dall-e- 3, 2023

OpenAI. DALL·E 3, https://openai.com/dall-e- 3, 2023. 2

work page 2023
[46]

Drag Your GAN: Interactive Point-based Manipulation on the Genera- tive Image Manifold

Xingang Pan, Ayush Tewari, Thomas Leimk ¨uhler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag Your GAN: Interactive Point-based Manipulation on the Genera- tive Image Manifold. In ACM SIGGRAPH, 2023. 2

work page 2023
[47]

Locate, Assign, Refine: Taming Cus- tomized Image Inpainting with Text-Subject Guidance

Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, and Jingfeng Zhang. Locate, Assign, Refine: Taming Cus- tomized Image Inpainting with Text-Subject Guidance. arXiv preprint arXiv:2403.19534, 2024. 2

work page arXiv 2024
[48]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In Int. Conf. Comput. Vis., pages 4195– 4305, 2023. 2

work page 2023
[49]

PiKa, https://pika.art/, 2025

PiKa. PiKa, https://pika.art/, 2025. 6, 7, 13

work page 2025
[50]

UniControl: A Unified Diffusion Model for Control- lable Visual Generation In the Wild

Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming 10 Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu. UniControl: A Unified Diffusion Model for Control- lable Visual Generation In the Wild. In Adv. Neural Inform. Process. Syst., 2023. 3

work page 2023
[51]

Towards Robust Monocu- lar Depth Estimation: Mixing Datasets for Zero-Shot Cross- Dataset Transfer

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards Robust Monocu- lar Depth Estimation: Mixing Datasets for Zero-Shot Cross- Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. , pages 1623–1637, 2022. 5

work page 2022
[52]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. In Int. Conf. Learn. ...

work page 2025
[53]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022. 2

work page 2022
[54]

U- Net: Convolutional Networks for Biomedical Image Seg- mentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional Networks for Biomedical Image Seg- mentation. Med. Image Comput. Computer-Assisted Interv.,

work page
[55]

Gen-3, https : / / app

Runway. Gen-3, https : / / app . runwayml . com / video-tools, 2025. 2

work page 2025
[56]

Denois- ing Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing Diffusion Implicit Models. In Int. Conf. Learn. Repre- sent., 2021. 2

work page 2021
[57]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equa- tions. In Int. Conf. Learn. Represent., 2021. 2

work page 2021
[58]

Stable Diffusion v2-1 Model Card, https: / / huggingface

StabilityAI. Stable Diffusion v2-1 Model Card, https: / / huggingface . co / stabilityai / stable - diffusion-2-1, 2022. 2

work page 2022
[59]

Stable Diffusion XL Model Card, https: / / huggingface

StabilityAI. Stable Diffusion XL Model Card, https: / / huggingface . co / stabilityai / stable - diffusion-xl-base-1.0 , 2022. 2

work page 2022
[60]

CosXL Model Card, https : //huggingface.co/stabilityai/cosxl, 2024

StabilityAI. CosXL Model Card, https : //huggingface.co/stabilityai/cosxl, 2024. 3

work page 2024
[61]

Im- ageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

Ya Sheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, and Hideki Koike. Im- ageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation. In Adv. Neural In- form. Process. Syst., 2023. 3

work page 2023
[62]

Resolution-Robust Large Mask Inpainting With Fourier Convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-Robust Large Mask Inpainting With Fourier Convolutions. In IEEE Winter Conf. Appl. Comput. Vis., pages 2149–2159, 2022. 5

work page 2022
[63]

OminiControl: Minimal and Uni- versal Control for Diffusion Transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. OminiControl: Minimal and Uni- versal Control for Diffusion Transformer. arXiv preprint arXiv:2411.15098, 2024. 2, 3

work page arXiv 2024
[64]

Wan: Open and advanced large-scale video gen- erative models

Wan Team. Wan: Open and advanced large-scale video gen- erative models. 2025. 2, 6, 13

work page 2025
[65]

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Zachary Teed and Jia Deng. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In Eur. Conf. Comput. Vis., pages 402–419, 2020. 5

work page 2020
[66]

Vidu, https://www.vidu.cn/, 2025

Vidu. Vidu, https://www.vidu.cn/, 2025. 6, 7, 13

work page 2025
[67]

InstantID: Zero-shot Identity-Preserving Gen- eration in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and An- thony Chen. InstantID: Zero-shot Identity-Preserving Gen- eration in Seconds. arXiv preprint arXiv:2401.07519, 2024. 2

work page arXiv 2024
[68]

VideoComposer: Compositional Video Synthesis with Motion Controllability

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji- uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. VideoComposer: Compositional Video Synthesis with Motion Controllability. In Adv. Neural Inform. Process. Syst., 2023. 2, 6, 13

work page 2023
[69]

MotionCtrl: A Uni- fied and Flexible Motion Controller for Video Generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. MotionCtrl: A Uni- fied and Flexible Motion Controller for Video Generation. In ACM SIGGRAPH, pages 1–11, 2024. 3

work page 2024
[70]

DreamVideo-2: Zero-Shot Subject-Driven Video Cus- tomization with Precise Motion Control

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, and Hongming Shan. DreamVideo-2: Zero-Shot Subject-Driven Video Cus- tomization with Precise Motion Control. arXiv preprint arXiv:2410.13830, 2024. 2

work page arXiv 2024
[71]

OmniGen: Unified Image Genera- tion

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. OmniGen: Unified Image Genera- tion. arXiv preprint arXiv:2409.11340, 2024. 2, 3

work page arXiv 2024
[72]

Ef- fective Whole-body Pose Estimation with Two-stages Distil- lation

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Ef- fective Whole-body Pose Estimation with Two-stages Distil- lation. In Int. Conf. Comput. Vis., pages 4210–4220, 2023. 5

work page 2023
[73]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. InInt. Conf. Learn. Represent.,

work page
[74]

Identity- Preserving Text-to-Video Generation by Frequency Decom- position

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- Preserving Text-to-Video Generation by Frequency Decom- position. In IEEE Conf. Comput. Vis. Pattern Recog., 2025. 2

work page 2025
[75]

MagicBrush: A Manually Annotated Dataset for Instruction- Guided Image Editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction- Guided Image Editing. In Adv. Neural Inform. Process. Syst.,

work page
[76]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In Int. Conf. Comput. Vis., pages 3836–3847, 2023. 2

work page 2023
[77]

arXiv preprint arXiv:2311.04145 (2023) 30

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models. arXiv preprint arXiv:2311.04145, 2023. 2, 6, 13

work page arXiv 2023
[78]

Rec- ognize Anything: A Strong Image Tagging Model

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, 11 Yaqian Li, Shilong Liu, Yandong Guo, and Lei Zhang. Rec- ognize Anything: A Strong Image Tagging Model. arXiv preprint arXiv:2306.03514, 2023. 5

work page arXiv 2023
[79]

ControlVideo: Training-free Controllable Text-to-Video Generation

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. ControlVideo: Training-free Controllable Text-to-Video Generation. In Int. Conf. Learn. Represent., 2024. 6, 13

work page 2024
[80]

Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers. arXiv preprint arXiv:2411.13503, 2025. 3

work page arXiv 2025
[81]

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Ru- jie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based Fine-Grained Image Editing at Scale. arXiv preprint arXiv:2407.05282v1,

work page arXiv

Showing first 80 references.