Recognition: 2 theorem links
· Lean TheoremVACE: All-in-One Video Creation and Editing
Pith reviewed 2026-05-16 00:48 UTC · model grok-4.3
The pith
VACE unifies reference-to-video generation, video-to-video editing, and masked editing in one diffusion transformer model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VACE enables unified video creation and editing by organizing task inputs such as editing, reference, and masking into a Video Condition Unit and injecting different task concepts through a Context Adapter that uses formalized temporal and spatial representations, allowing a single diffusion transformer to handle arbitrary video synthesis tasks with performance on par with task-specific models and support for versatile task combinations.
What carries the argument
The Video Condition Unit that unifies editing, reference, and masking inputs together with the Context Adapter structure that injects task-specific concepts into the model along temporal and spatial dimensions.
If this is right
- Users can execute reference-to-video generation, video-to-video editing, and masked editing through one model instead of switching systems.
- Task combinations become possible that were previously blocked by the need for separate models.
- Performance across subtasks stays comparable to dedicated models rather than falling behind.
- The pipeline for video content creation simplifies because a single trained network covers multiple use cases.
Where Pith is reading between the lines
- Deployment could become cheaper by maintaining only one large model instead of several task-specific ones.
- The same unification pattern might extend to additional tasks such as text-guided video or audio-conditioned editing.
- Real-time or interactive video applications could benefit from reduced switching overhead between models.
- Generalization across rare or novel task mixes might improve because the shared backbone learns richer video representations.
Load-bearing premise
The Video Condition Unit and Context Adapter can integrate the requirements of reference-to-video, video-to-video, and masked editing tasks into one model without causing performance to drop below that of specialized systems.
What would settle it
A side-by-side benchmark on a standard video dataset where VACE scores more than 5 percent worse than a dedicated reference-to-video model on metrics such as FVD or human preference ratings.
read the original abstract
Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VACE, a unified Diffusion Transformer framework for video creation and editing. It organizes inputs for reference-to-video generation, video-to-video editing, and masked video-to-video editing via a Video Condition Unit (VCU) and injects task concepts through a Context Adapter using formalized temporal and spatial representations. The central claim is that this single model achieves performance on par with task-specific models across subtasks while supporting versatile task combinations.
Significance. If the parity result holds with rigorous evidence, the work would advance unified video synthesis by demonstrating that a single architecture can handle multiple video tasks without degradation, enabling flexible combinations and reducing reliance on specialized models. This addresses the challenge of temporal-spatial consistency in video generation and could influence scalable all-in-one systems in the field.
major comments (2)
- [Experiments] Experiments section: the claim that VACE achieves on-par performance with task-specific models lacks reported quantitative metrics (e.g., FVD, FID, CLIP scores), baseline comparisons, ablation studies, or error analysis. Without these, the unification claim cannot be verified and the weakest assumption—that VCU and Context Adapter avoid negative transfer—remains untested.
- [Method] Method section describing VCU and Context Adapter: the integration of reference, editing, and masking signals into a unified interface is load-bearing for the no-degradation result, yet no controls or ablations isolate cross-task interference (e.g., potential collisions between mask tokens and reference features in high-motion regions).
minor comments (2)
- [Abstract] Abstract: the phrase 'extensive experiments demonstrate' should be supported by explicit mention of datasets, evaluation protocols, and at least one key quantitative result to strengthen the summary.
- [Method] Notation: the formalization of temporal and spatial dimensions in the Context Adapter would benefit from an explicit equation or diagram showing how task concepts are injected without overlap.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. The comments highlight important areas for improvement in presenting the experimental validation and methodological robustness. We will revise the manuscript to address these points as detailed below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the claim that VACE achieves on-par performance with task-specific models lacks reported quantitative metrics (e.g., FVD, FID, CLIP scores), baseline comparisons, ablation studies, or error analysis. Without these, the unification claim cannot be verified and the weakest assumption—that VCU and Context Adapter avoid negative transfer—remains untested.
Authors: We acknowledge the need for more rigorous quantitative support. The original manuscript includes some qualitative results and limited metrics, but to fully verify the on-par performance claim, we will add in the revision: comprehensive tables with FVD, FID, CLIP scores across all tasks, comparisons to multiple baselines, ablation studies on VCU and Context Adapter components, and error analysis for cases where performance differs. This will directly test and support the no negative transfer assumption. revision: yes
-
Referee: [Method] Method section describing VCU and Context Adapter: the integration of reference, editing, and masking signals into a unified interface is load-bearing for the no-degradation result, yet no controls or ablations isolate cross-task interference (e.g., potential collisions between mask tokens and reference features in high-motion regions).
Authors: We agree that isolating potential interference is crucial. In the revised manuscript, we will include additional ablation studies that separately evaluate the impact of each input type in the VCU and their combinations. Specifically, we will report results on high-motion video segments to check for any collisions or degradations between mask and reference signals. These controls will strengthen the evidence for the unified interface's effectiveness. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The paper presents VACE as a novel architecture that organizes inputs via the Video Condition Unit and injects concepts through the Context Adapter, with the central claim of parity to task-specific models resting on empirical validation from extensive experiments rather than any self-referential derivation, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or steps in the provided abstract reduce by construction to the inputs; the unification is introduced as a design choice whose effectiveness is asserted to be shown externally, making the derivation self-contained against benchmarks.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Video Condition Unit (VCU)
no independent evidence
-
Context Adapter
no independent evidence
Forward citations
Cited by 18 Pith papers
-
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
-
MiVE: Multiscale Vision-language features for reference-guided video Editing
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
-
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
-
Controllable Generative Video Compression
CGVC uses coded keyframes and per-frame priors to guide controllable generative reconstruction of video frames, outperforming prior perceptual compression methods in both signal fidelity and perceptual quality.
-
Physics-Aware Video Instance Removal Benchmark
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
-
FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
FaithfulFaces introduces a pose-faithful identity aligner with a shared dictionary and invariance constraint to maintain facial identity in text-to-video generation under large pose changes and occlusions.
-
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
-
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
-
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and ...
-
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
-
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
-
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.
-
DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models
DiffHDR converts LDR videos to HDR by formulating the task as generative radiance inpainting in a video diffusion model's latent space, using Log-Gamma encoding and synthesized training data to achieve better fidelity...
-
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Vanast produces coherent garment-transferred human animation videos from a single human image, garment images, and pose guidance video using synthetic triplet supervision and a Dual Module video diffusion transformer ...
-
From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
SVOR achieves stable, shadow-free video object removal under real-world imperfections via MUSE mask handling, DA-Seg localization, and curriculum training on real and synthetic data.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
- [1]
-
[2]
Stable Diffusion v1.5 Model Card, https://huggingface.co/runwayml/stable- diffusion-v1-5, 2022
Runway AI. Stable Diffusion v1.5 Model Card, https://huggingface.co/runwayml/stable- diffusion-v1-5, 2022. 2
work page 2022
-
[3]
Runway AI. Stable Diffusion Inpainting Model Card, https://huggingface.co/runwayml/stable- diffusion-inpainting, 2022. 2
work page 2022
-
[4]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structPix2Pix: Learning To Follow Image Editing Instruc- tions. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 18392–18402, 2023. 2, 3
work page 2023
-
[5]
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell., 43(1):172–186, 2021. 5
work page 2021
-
[6]
Learning To Generate Line Drawings That Convey Geometry and Se- mantics
Caroline Chan, Fr ´edo Durand, and Phillip Isola. Learning To Generate Line Drawings That Convey Geometry and Se- mantics. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7915–7925, 2022. 5
work page 2022
-
[7]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Syn- thesis. arXiv preprint arXiv:2310.00426, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation
Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation. In Assoc. Adv. Artif. Intell., 2025. 6, 13
work page 2025
-
[9]
Goku: Flow Based Video Generative Foundation Models
Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow Based Video Generative Foundation Models. arXiv preprint arXiv:2502.04896, 2025. 2
-
[10]
Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Ji- ashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control- A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning.arXiv preprint arXiv:2305.13840, 2023. 6, 13
-
[11]
AnyDoor: Zero-shot Object-level Image Customization
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. AnyDoor: Zero-shot Object-level Image Customization. arXiv preprint arXiv:2307.09481 ,
-
[12]
UniReal: Universal Image Generation and Editing via Learn- ing Real-world Dynamics
Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, and Hengshuang Zhao. UniReal: Universal Image Generation and Editing via Learn- ing Real-world Dynamics. arXiv preprint arXiv:2412.07774,
-
[13]
Tongyi Wanxiang, https://tongyi
Alibaba Cloud. Tongyi Wanxiang, https://tongyi. aliyun.com/wanxiang, 2023. 2
work page 2023
-
[14]
FLATTEN: Optical FLow-guided ATTENtion for consistent text-to-video edit- ing
Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. FLATTEN: Optical FLow-guided ATTENtion for consistent text-to-video edit- ing. In Int. Conf. Learn. Represent., 2024. 6, 7, 13
work page 2024
-
[15]
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation
Lunhao Duan, Shanshan Zhao, Wenjun Yan, Yinglun Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ming- ming Gong, and Gui-Song Xia. UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation. arXiv preprint arXiv:2412.18928 ,
-
[16]
Scaling Rectified Flow Trans- formers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling Rectified Flow Trans- formers for High-Resolution Image Synthesis. In Int. Conf. Mach. Learn., 2024. 2
work page 2024
-
[17]
Hierar- chical Masked 3D Diffusion Model for Video Outpainting
Fanda Fan, Chaoxu Guo, Litong Gong, Biao Wang, Tiezheng Ge, Yuning Jiang, Chunjie Luo, and Jianfeng Zhan. Hierar- chical Masked 3D Diffusion Model for Video Outpainting. In ACM Int. Conf. Multimedia , pages 7890–7900, 2023. 6, 13
work page 2023
- [18]
-
[19]
SEED-Data-Edit Technical Report: A Hybrid Dataset for In- structional Image Editing
Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. SEED-Data-Edit Technical Report: A Hybrid Dataset for In- structional Image Editing. arXiv preprint arXiv:2405.04007,
-
[20]
I2V- Adapter: A General Image-to-Video Adapter for Diffusion Models
Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, Zhengjun Zha, Haibin Huang, and Chongyang Ma. I2V- Adapter: A General Image-to-Video Adapter for Diffusion Models. In ACM SIGGRAPH, pages 1–12, 2024. 2
work page 2024
-
[21]
PuLID: Pure and Lightning ID Cus- tomization via Contrastive Alignment
Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. PuLID: Pure and Lightning ID Cus- tomization via Contrastive Alignment. In Adv. Neural In- form. Process. Syst., 2024. 2
work page 2024
-
[22]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime Video Latent Diffusion. arXiv preprint arXiv:2501.00103, 2025. 2, 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
ACE: All-round Creator and Editor Following Instructions via Dif- fusion Transformer
Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chao- jie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. ACE: All-round Creator and Editor Following Instructions via Dif- fusion Transformer. In Int. Conf. Learn. Represent., 2025. 2, 3 9
work page 2025
-
[24]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. In Adv. Neural Inform. Process. Syst., 2021. 2
work page 2021
-
[25]
Denoising Dif- fusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- fusion Probabilistic Models. In Adv. Neural Inform. Process. Syst. Curran Associates, Inc., 2020. 2
work page 2020
-
[26]
Composer: Creative and Controllable Im- age Synthesis with Composable Conditions
Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and Controllable Im- age Synthesis with Composable Conditions. In Int. Conf. Mach. Learn., 2023. 2
work page 2023
-
[27]
VBench: Com- prehensive Benchmark Suite for Video Generative Models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive Benchmark Suite for Video Generative Models. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 21807– 21818, 2024. 5, 7
work page 2024
-
[29]
Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone
Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Ao Ma, Yiliang Lv, Yujun Shen, Deli Zhao, and Jingren Zhou. Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone. In Adv. Neural Inform. Process. Syst.,
-
[30]
SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
Zeyinzi Jiang, Chaojie Mao, Yulin Pan, Zhen Han, and Jingfeng Zhang. SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8995–9004, 2024. 2
work page 2024
-
[31]
Text2Video-Zero: Text- to-Image Diffusion Models are Zero-Shot Video Generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2Video-Zero: Text- to-Image Diffusion Models are Zero-Shot Video Generators. In Int. Conf. Comput. Vis., pages 15954–15964, 2023. 6, 13
work page 2023
-
[32]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan- DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. arXiv preprint arXiv:2405.08748, 2024. 2
-
[34]
MagicEdit: High-Fidelity and Temporally Coherent Video Editing
Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, and Jiashi Feng. MagicEdit: High-Fidelity and Temporally Coherent Video Editing. arXiv preprint arXiv:2308.14749,
-
[35]
Phantom: Subject- consistent video generation via cross-modal alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment. arXiv preprint arXiv:2502.11079, 2025. 2, 3
-
[36]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Video-P2P: Video Editing with Cross-attention Con- trol
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Ji- aya Jia. Video-P2P: Video Editing with Cross-attention Con- trol. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8599– 8608, 2024. 3
work page 2024
-
[38]
Cones: Concept Neurons in Diffusion Models for Cus- tomized Generation
Zhiheng Liu, Ruili Feng, Kai Zhu, Yifei Zhang, Kecheng Zheng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones: Concept Neurons in Diffusion Models for Cus- tomized Generation. In Int. Conf. Mach. Learn., 2023. 2
work page 2023
-
[39]
Cones 2: Customizable Image Synthesis with Multiple Subjects
Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones 2: Customizable Image Synthesis with Multiple Subjects. In Adv. Neural Inform. Process. Syst., 2023. 2
work page 2023
-
[40]
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose- Free Videos
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, and Qifeng Chen. Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose- Free Videos. In Assoc. Adv. Artif. Intell., 2024. 6, 13
work page 2024
-
[41]
ACE++: Instruction- Based Image Creation and Editing via Context-Aware Con- tent Filling
Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. ACE++: Instruction- Based Image Creation and Editing via Context-Aware Con- tent Filling. arXiv preprint arXiv:2501.02487, 2025. 2, 3
-
[42]
SDEdit: Guided Im- age Synthesis and Editing with Stochastic Differential Equa- tions
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided Im- age Synthesis and Editing with Stochastic Differential Equa- tions. In Int. Conf. Learn. Represent., 2021. 2
work page 2021
-
[43]
Midjourney, https://www.midjourney
Midjourney. Midjourney, https://www.midjourney. com, 2023. 2
work page 2023
-
[44]
Hailuo AI Video, https://hailuoai.com/ video, 2024
MiniMax. Hailuo AI Video, https://hailuoai.com/ video, 2024
work page 2024
-
[45]
DALL·E 3, https://openai.com/dall-e- 3, 2023
OpenAI. DALL·E 3, https://openai.com/dall-e- 3, 2023. 2
work page 2023
-
[46]
Drag Your GAN: Interactive Point-based Manipulation on the Genera- tive Image Manifold
Xingang Pan, Ayush Tewari, Thomas Leimk ¨uhler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag Your GAN: Interactive Point-based Manipulation on the Genera- tive Image Manifold. In ACM SIGGRAPH, 2023. 2
work page 2023
-
[47]
Locate, Assign, Refine: Taming Cus- tomized Image Inpainting with Text-Subject Guidance
Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, and Jingfeng Zhang. Locate, Assign, Refine: Taming Cus- tomized Image Inpainting with Text-Subject Guidance. arXiv preprint arXiv:2403.19534, 2024. 2
-
[48]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In Int. Conf. Comput. Vis., pages 4195– 4305, 2023. 2
work page 2023
- [49]
-
[50]
UniControl: A Unified Diffusion Model for Control- lable Visual Generation In the Wild
Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming 10 Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu. UniControl: A Unified Diffusion Model for Control- lable Visual Generation In the Wild. In Adv. Neural Inform. Process. Syst., 2023. 3
work page 2023
-
[51]
Towards Robust Monocu- lar Depth Estimation: Mixing Datasets for Zero-Shot Cross- Dataset Transfer
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards Robust Monocu- lar Depth Estimation: Mixing Datasets for Zero-Shot Cross- Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. , pages 1623–1637, 2022. 5
work page 2022
-
[52]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. In Int. Conf. Learn. ...
work page 2025
-
[53]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022. 2
work page 2022
-
[54]
U- Net: Convolutional Networks for Biomedical Image Seg- mentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional Networks for Biomedical Image Seg- mentation. Med. Image Comput. Computer-Assisted Interv.,
-
[55]
Runway. Gen-3, https : / / app . runwayml . com / video-tools, 2025. 2
work page 2025
-
[56]
Denois- ing Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing Diffusion Implicit Models. In Int. Conf. Learn. Repre- sent., 2021. 2
work page 2021
-
[57]
Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equa- tions. In Int. Conf. Learn. Represent., 2021. 2
work page 2021
-
[58]
Stable Diffusion v2-1 Model Card, https: / / huggingface
StabilityAI. Stable Diffusion v2-1 Model Card, https: / / huggingface . co / stabilityai / stable - diffusion-2-1, 2022. 2
work page 2022
-
[59]
Stable Diffusion XL Model Card, https: / / huggingface
StabilityAI. Stable Diffusion XL Model Card, https: / / huggingface . co / stabilityai / stable - diffusion-xl-base-1.0 , 2022. 2
work page 2022
-
[60]
CosXL Model Card, https : //huggingface.co/stabilityai/cosxl, 2024
StabilityAI. CosXL Model Card, https : //huggingface.co/stabilityai/cosxl, 2024. 3
work page 2024
-
[61]
Im- ageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
Ya Sheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, and Hideki Koike. Im- ageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation. In Adv. Neural In- form. Process. Syst., 2023. 3
work page 2023
-
[62]
Resolution-Robust Large Mask Inpainting With Fourier Convolutions
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-Robust Large Mask Inpainting With Fourier Convolutions. In IEEE Winter Conf. Appl. Comput. Vis., pages 2149–2159, 2022. 5
work page 2022
-
[63]
OminiControl: Minimal and Uni- versal Control for Diffusion Transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. OminiControl: Minimal and Uni- versal Control for Diffusion Transformer. arXiv preprint arXiv:2411.15098, 2024. 2, 3
-
[64]
Wan: Open and advanced large-scale video gen- erative models
Wan Team. Wan: Open and advanced large-scale video gen- erative models. 2025. 2, 6, 13
work page 2025
-
[65]
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Zachary Teed and Jia Deng. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In Eur. Conf. Comput. Vis., pages 402–419, 2020. 5
work page 2020
- [66]
-
[67]
InstantID: Zero-shot Identity-Preserving Gen- eration in Seconds
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and An- thony Chen. InstantID: Zero-shot Identity-Preserving Gen- eration in Seconds. arXiv preprint arXiv:2401.07519, 2024. 2
-
[68]
VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji- uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. VideoComposer: Compositional Video Synthesis with Motion Controllability. In Adv. Neural Inform. Process. Syst., 2023. 2, 6, 13
work page 2023
-
[69]
MotionCtrl: A Uni- fied and Flexible Motion Controller for Video Generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. MotionCtrl: A Uni- fied and Flexible Motion Controller for Video Generation. In ACM SIGGRAPH, pages 1–11, 2024. 3
work page 2024
-
[70]
DreamVideo-2: Zero-Shot Subject-Driven Video Cus- tomization with Precise Motion Control
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, and Hongming Shan. DreamVideo-2: Zero-Shot Subject-Driven Video Cus- tomization with Precise Motion Control. arXiv preprint arXiv:2410.13830, 2024. 2
-
[71]
OmniGen: Unified Image Genera- tion
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. OmniGen: Unified Image Genera- tion. arXiv preprint arXiv:2409.11340, 2024. 2, 3
-
[72]
Ef- fective Whole-body Pose Estimation with Two-stages Distil- lation
Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Ef- fective Whole-body Pose Estimation with Two-stages Distil- lation. In Int. Conf. Comput. Vis., pages 4210–4220, 2023. 5
work page 2023
-
[73]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. InInt. Conf. Learn. Represent.,
-
[74]
Identity- Preserving Text-to-Video Generation by Frequency Decom- position
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yu- jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- Preserving Text-to-Video Generation by Frequency Decom- position. In IEEE Conf. Comput. Vis. Pattern Recog., 2025. 2
work page 2025
-
[75]
MagicBrush: A Manually Annotated Dataset for Instruction- Guided Image Editing
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction- Guided Image Editing. In Adv. Neural Inform. Process. Syst.,
-
[76]
Adding Conditional Control to Text-to-Image Diffusion Models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In Int. Conf. Comput. Vis., pages 3836–3847, 2023. 2
work page 2023
-
[77]
arXiv preprint arXiv:2311.04145 (2023) 30
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models. arXiv preprint arXiv:2311.04145, 2023. 2, 6, 13
-
[78]
Rec- ognize Anything: A Strong Image Tagging Model
Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, 11 Yaqian Li, Shilong Liu, Yandong Guo, and Lei Zhang. Rec- ognize Anything: A Strong Image Tagging Model. arXiv preprint arXiv:2306.03514, 2023. 5
-
[79]
ControlVideo: Training-free Controllable Text-to-Video Generation
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. ControlVideo: Training-free Controllable Text-to-Video Generation. In Int. Conf. Learn. Represent., 2024. 6, 13
work page 2024
-
[80]
Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers. arXiv preprint arXiv:2411.13503, 2025. 3
-
[81]
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Ru- jie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based Fine-Grained Image Editing at Scale. arXiv preprint arXiv:2407.05282v1,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.