Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
Pith reviewed 2026-05-20 14:03 UTC · model grok-4.3
The pith
Omni-Customizer fuses multimodal identity cues into text prompts to achieve consistent visual identities and vocal timbres in joint audio-video generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Omni-Customizer is an end-to-end framework for precise multimodal customization in joint audio-video generation. It uses an Omni-Context Fusion module to enrich textual prompts with dense multimodal identity cues, a Masked TTS Cross-Attention mechanism to prevent speech leakage, and Semantic-Anchored Multimodal RoPE to anchor visual, audio, and TTS reference tokens to corresponding semantic descriptions. A comprehensive training strategy with interleaved audio-video scheduling and progressive in-pair to cross-pair curriculum enables rapid multilingual adaptation and robust high-level identity feature learning.
What carries the argument
Omni-Context Fusion (OCF) module that enriches textual prompts with multimodal identity cues, paired with Semantic-Anchored Multimodal RoPE (SA-MRoPE) for anchoring reference tokens to semantic descriptions and Masked TTS Cross-Attention (MTP-CA) to prevent speech leakage.
If this is right
- Preserves visual identities and vocal timbres simultaneously across multiple interacting subjects in generated videos.
- Prevents severe speech leakage during the generation process.
- Supports rapid adaptation to multilingual audio scenarios without loss of base model performance.
- Delivers improved audio-video synchronization and overall fidelity in customized outputs.
Where Pith is reading between the lines
- The modular fusion approach could support extensions to additional input modalities such as depth maps or motion references for richer customization.
- Applications in media production might include creating personalized dubbed videos where actors' appearances remain fixed while voices change languages.
- The curriculum-based training pattern may generalize to other multimodal models that need to incorporate new reference types incrementally.
Load-bearing premise
The training strategy of interleaved audio-video scheduling and progressive curriculum can adapt the audio branch to multilingual scenarios and new identity combinations without degrading the base model's foundational priors.
What would settle it
Test the generated outputs on a held-out multilingual dataset or unseen subject combinations and check whether visual identity similarity scores or timbre consistency metrics fall below those of existing baseline methods.
Figures
read the original abstract
The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Omni-Customizer, an end-to-end framework for joint audio-video generation that enables precise multimodal customization while preserving visual identities and vocal timbres across multiple interacting subjects. It introduces an Omni-Context Fusion (OCF) module to enrich textual prompts with dense identity cues, a Masked TTS Cross-Attention (MTP-CA) mechanism to mitigate speech leakage, Semantic-Anchored Multimodal RoPE (SA-MRoPE) for structured token anchoring, and a training regimen combining interleaved audio-video scheduling with a progressive in-pair to cross-pair curriculum. The central claim, supported by experiments, is state-of-the-art performance on metrics including visual identity similarity, timbre consistency, audio-video synchronization, and overall fidelity.
Significance. If the reported quantitative results and ablations hold, the work makes a meaningful contribution to multimodal generative modeling by tackling the underexplored problem of simultaneous visual-audio identity binding in multi-subject settings. The architectural components and curriculum-based training strategy provide concrete, reproducible advances that could inform downstream applications in personalized video synthesis and dubbing. The inclusion of baseline comparisons and ablation studies isolating the interleaved scheduling strengthens the evidential basis.
major comments (2)
- [Experiments] Experiments section: the reported gains in identity similarity and synchronization are quantified against baselines, but the manuscript does not specify the exact composition of the multi-subject test set (number of subjects per video, interaction density) used for the cross-pair evaluation; this detail is load-bearing for assessing whether the SOTA claim generalizes beyond the training distribution.
- [§3.2] §3.2, description of MTP-CA: the masking strategy is presented as preventing speech leakage, yet no ablation isolates the contribution of the masking ratio versus the cross-attention design itself; without this, it is difficult to attribute the timbre consistency improvements specifically to MTP-CA rather than the overall training schedule.
minor comments (3)
- [Abstract] The abstract contains minor grammatical inconsistencies (e.g., 'MultiModal' capitalization and 'excelling across ... and overall video-audio fidelity').
- [Figure 3] Figure 3 (architecture diagram) would benefit from explicit arrows indicating the flow of TTS embeddings into SA-MRoPE to improve readability of the multimodal fusion path.
- [Related Work] A few references to prior multimodal RoPE variants are cited but lack direct comparison in the related-work section; adding one sentence contrasting SA-MRoPE with the closest prior work would aid context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analysis.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the reported gains in identity similarity and synchronization are quantified against baselines, but the manuscript does not specify the exact composition of the multi-subject test set (number of subjects per video, interaction density) used for the cross-pair evaluation; this detail is load-bearing for assessing whether the SOTA claim generalizes beyond the training distribution.
Authors: We agree that explicit details on test-set composition are necessary to evaluate generalization. We will expand Section 4.1 in the revised manuscript to specify the exact composition of the multi-subject test set used for cross-pair evaluation, including the number of subjects per video and quantitative measures of interaction density. revision: yes
-
Referee: [§3.2] §3.2, description of MTP-CA: the masking strategy is presented as preventing speech leakage, yet no ablation isolates the contribution of the masking ratio versus the cross-attention design itself; without this, it is difficult to attribute the timbre consistency improvements specifically to MTP-CA rather than the overall training schedule.
Authors: We acknowledge that an ablation isolating the masking ratio from the cross-attention design would strengthen attribution of the timbre-consistency gains. We will add this targeted ablation to Section 4.3 of the revised manuscript while retaining the existing module-level ablations. revision: yes
Circularity Check
No significant circularity detected
full rationale
The manuscript describes an architectural framework (OCF, MTP-CA, SA-MRoPE) and training regimen (interleaved scheduling, curriculum) for multimodal audio-video customization, with performance claims grounded in quantitative experiments and ablations against external baselines. No equations, derivations, fitted parameters, or first-principles predictions are presented that could reduce to self-definition or self-citation chains; the central results remain independent and falsifiable via reported metrics on identity similarity, timbre consistency, and synchronization.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Interleaved audio-video scheduling rapidly adapts the audio branch to multilingual scenarios without degrading foundational priors
- domain assumption Progressive in-pair to cross-pair curriculum facilitates the learning of high-level and robust identity features
invented entities (3)
-
Omni-Context Fusion (OCF) module
no independent evidence
-
Masked TTS Cross-Attention (MTP-CA) mechanism
no independent evidence
-
Semantic-Anchored Multimodal RoPE (SA-MRoPE)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Omni-Context Fusion (OCF) module... Semantic-Anchored Multimodal RoPE (SA-MRoPE)... Masked TTS Cross-Attention (MTP-CA)... interleaved audio-video scheduling... progressive in-pair to cross-pair curriculum
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual-stream Diffusion Transformer (DiT) architecture, initialized directly from the pre-trained Ovi backbone
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Yaofei Wu, and Wenwu Zhu. Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models.IEEE Transactions on Multimedia, 2025
work page 2025
-
[2]
First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025
Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025
-
[3]
Humo: Human-centric video generation via collaborative multi-modal conditioning,
Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025
-
[4]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022
work page 2022
-
[5]
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255–6271, 2025
work page 2025
-
[6]
Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, et al. Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025
-
[7]
Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025
work page 2025
-
[8]
Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining
Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023
-
[9]
Out of time: automated lip sync in the wild
Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016
work page 2016
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019
work page 2019
-
[12]
Retinaface: Single-shot multi-level face localisation in the wild
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020
work page 2020
-
[13]
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88 (2):303–338, 2010
work page 2010
-
[15]
Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025
Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transform- ers.arXiv preprint arXiv:2504.02436, 2025
-
[16]
Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025
Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025
-
[17]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023
work page 2023
-
[18]
Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Song- tao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026
-
[19]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation
Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In2024 IEEE Spoken Language Technology Workshop (SLT), pages 885–890. IEEE, 2024
work page 2024
-
[21]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[22]
Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024
Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024
-
[23]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
work page 2022
-
[24]
Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, and Ran Yi. Harmony: Harmonizing audio and video generation through cross-task synergy.arXiv preprint arXiv:2511.21579, 2025
-
[26]
Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025
-
[27]
Videomage: Multi-subject and motion customization of text-to-video diffusion models
Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, and Yu- Chiang Frank Wang. Videomage: Multi-subject and motion customization of text-to-video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17603–17612, 2025
work page 2025
-
[28]
Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025. 11
-
[29]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[30]
Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis.Advances in neural information processing systems, 31, 2018
work page 2018
-
[31]
Vace: All-in- one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025
work page 2025
-
[32]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021
work page 2021
-
[33]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Multi- concept customization of text-to-image diffusion
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023
work page 1931
-
[35]
Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation
Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7752–7762, 2025
work page 2025
-
[36]
Bindweave: Subject-consistent video generation via cross-modal integration,
Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross- modal integration.arXiv preprint arXiv:2510.00438, 2025
-
[37]
Movie weaver: Tuning-free multi-concept video personalization with anchored prompts
Feng Liang, Haoyu Ma, Zecheng He, Tingbo Hou, Ji Hou, Kunpeng Li, Xiaoliang Dai, Felix Juefei-Xu, Samaneh Azadi, Animesh Sinha, et al. Movie weaver: Tuning-free multi-concept video personalization with anchored prompts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13146–13156, 2025
work page 2025
-
[38]
Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Jiebo Luo, Ziwei Liu, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025
-
[39]
Phantom: Subject-consistent video generation via cross-modal alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14951– 14961, 2025
work page 2025
-
[40]
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models.Advances in Neural Information Processing Systems, 36:48855–48876, 2023
work page 2023
-
[42]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[43]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 12
work page 2023
-
[44]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[45]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[46]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023
work page 2023
-
[47]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
work page 2022
-
[48]
Seedance 2.0: Advancing Video Generation for World Complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025
-
[50]
Decouple content and motion for conditional image-to-video generation
Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, and Jinzhi Wang. Decouple content and motion for conditional image-to-video generation. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4757–4765, 2024
work page 2024
-
[51]
Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026
OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026
-
[52]
Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023
-
[53]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
UniVerse-1: Unified audio-video generation via stitching of experts,
Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025
-
[55]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025
-
[58]
Identity-preserving text-to-video generation by frequency decomposition
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025. 13
work page 2025
-
[59]
Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Bin Liu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.International Journal of Computer Vision, 134(1):46, 2026
work page 2026
-
[60]
Kaleido: Open-sourced multi-subject reference video generation model,
Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, and Meng Wang. Kaleido: Open-sourced multi-subject reference video generation model.arXiv preprint arXiv:2510.18573, 2025
-
[61]
Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling.arXiv preprint arXiv:2303.03926, 2023
-
[62]
Motiondirector: Motion customization of text-to-video diffusion models
Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng Shou. Motiondirector: Motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 273–290. Springer, 2024
work page 2024
-
[63]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.