Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Jiangning Zhang; Lizhuang Ma; Qingdong He; Teng Hu; Yabiao Wang; Yuheng Chen; Yuji Wang

arxiv: 2605.17488 · v1 · pith:MEX3P4DBnew · submitted 2026-05-17 · 💻 cs.CV · cs.MM· cs.SD

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Yuheng Chen , Qingdong He , Teng Hu , Yuji Wang , Yabiao Wang , Lizhuang Ma , Jiangning Zhang This is my paper

Pith reviewed 2026-05-20 14:03 UTC · model grok-4.3

classification 💻 cs.CV cs.MMcs.SD

keywords multimodal customizationaudio-video generationidentity preservationjoint generationspeech leakage preventionmultimodal fusioncustomized videotimbre consistency

0 comments

The pith

Omni-Customizer fuses multimodal identity cues into text prompts to achieve consistent visual identities and vocal timbres in joint audio-video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Omni-Customizer as an end-to-end framework for joint audio and video generation that preserves visual identities and vocal timbres across multiple interacting subjects. It adds an Omni-Context Fusion module to enrich base text prompts with dense identity information from reference images and audio. A Masked TTS Cross-Attention mechanism is included to block speech leakage into the video output. Semantic-Anchored Multimodal RoPE anchors reference tokens to their semantic descriptions for structured fusion. The training uses interleaved audio-video scheduling and a progressive curriculum to adapt quickly to multilingual cases while keeping prior knowledge intact. Experiments report state-of-the-art results on identity similarity, timbre consistency, synchronization, and overall fidelity.

Core claim

Omni-Customizer is an end-to-end framework for precise multimodal customization in joint audio-video generation. It uses an Omni-Context Fusion module to enrich textual prompts with dense multimodal identity cues, a Masked TTS Cross-Attention mechanism to prevent speech leakage, and Semantic-Anchored Multimodal RoPE to anchor visual, audio, and TTS reference tokens to corresponding semantic descriptions. A comprehensive training strategy with interleaved audio-video scheduling and progressive in-pair to cross-pair curriculum enables rapid multilingual adaptation and robust high-level identity feature learning.

What carries the argument

Omni-Context Fusion (OCF) module that enriches textual prompts with multimodal identity cues, paired with Semantic-Anchored Multimodal RoPE (SA-MRoPE) for anchoring reference tokens to semantic descriptions and Masked TTS Cross-Attention (MTP-CA) to prevent speech leakage.

If this is right

Preserves visual identities and vocal timbres simultaneously across multiple interacting subjects in generated videos.
Prevents severe speech leakage during the generation process.
Supports rapid adaptation to multilingual audio scenarios without loss of base model performance.
Delivers improved audio-video synchronization and overall fidelity in customized outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular fusion approach could support extensions to additional input modalities such as depth maps or motion references for richer customization.
Applications in media production might include creating personalized dubbed videos where actors' appearances remain fixed while voices change languages.
The curriculum-based training pattern may generalize to other multimodal models that need to incorporate new reference types incrementally.

Load-bearing premise

The training strategy of interleaved audio-video scheduling and progressive curriculum can adapt the audio branch to multilingual scenarios and new identity combinations without degrading the base model's foundational priors.

What would settle it

Test the generated outputs on a held-out multilingual dataset or unseen subject combinations and check whether visual identity similarity scores or timbre consistency metrics fall below those of existing baseline methods.

Figures

Figures reproduced from arXiv: 2605.17488 by Jiangning Zhang, Lizhuang Ma, Qingdong He, Teng Hu, Yabiao Wang, Yuheng Chen, Yuji Wang.

**Figure 2.** Figure 2: Framework of Omni-Customizer: The text prompt, TTS embeddings, reference images, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with state-of-the-art baselines chosen from four different paradigms. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative ablation study of proposed modules and strategies. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Omni-Customizer shows a workable end-to-end setup for multi-subject audio-video identity binding with actual metrics and ablations behind the claims.

read the letter

The main thing to know is that this paper delivers a full framework for joint audio and video generation that keeps visual identities and vocal timbres consistent across subjects, and the full text includes quantitative comparisons plus ablations that make the results checkable rather than just asserted in the abstract. The abstract alone looked thin on evidence, but the manuscript adds the numbers on identity similarity, timbre consistency, synchronization, and fidelity against baselines. The ablations also separate out the interleaved audio-video scheduling and the progressive curriculum, which helps show what actually moves the needle. What is new here is the specific assembly: the Omni-Context Fusion module to pack multimodal cues into the prompt, the Masked TTS Cross-Attention to block speech leakage, and the Semantic-Anchored Multimodal RoPE to tie reference tokens to semantic anchors. The training side adds interleaved scheduling for fast multilingual adaptation without wrecking the base model and a curriculum that moves from in-pair to cross-pair examples. These are targeted fixes for the dual-modal case rather than brand-new primitives. The paper does well on the experimental reporting. The metrics line up with the stated goals, and the ablations give a clearer picture of component contributions than many similar generative papers. Soft spots are limited. The modules read as practical combinations of cross-attention and rotary embeddings already in use elsewhere, so the advance sits more in the integration and training recipe than in fundamental new math or theory. Generalization beyond the evaluated datasets and video lengths is not heavily tested, which is common but still worth noting for downstream use. This work is aimed at researchers and engineers already building multimodal diffusion or transformer systems who need practical identity control for content tools. Someone familiar with recent audio-video generation papers will extract the implementation and curriculum details without much trouble. I would send it for peer review. The quantitative backing is present and the architecture is described clearly enough for referees to evaluate the claims directly.

Referee Report

2 major / 3 minor

Summary. The paper presents Omni-Customizer, an end-to-end framework for joint audio-video generation that enables precise multimodal customization while preserving visual identities and vocal timbres across multiple interacting subjects. It introduces an Omni-Context Fusion (OCF) module to enrich textual prompts with dense identity cues, a Masked TTS Cross-Attention (MTP-CA) mechanism to mitigate speech leakage, Semantic-Anchored Multimodal RoPE (SA-MRoPE) for structured token anchoring, and a training regimen combining interleaved audio-video scheduling with a progressive in-pair to cross-pair curriculum. The central claim, supported by experiments, is state-of-the-art performance on metrics including visual identity similarity, timbre consistency, audio-video synchronization, and overall fidelity.

Significance. If the reported quantitative results and ablations hold, the work makes a meaningful contribution to multimodal generative modeling by tackling the underexplored problem of simultaneous visual-audio identity binding in multi-subject settings. The architectural components and curriculum-based training strategy provide concrete, reproducible advances that could inform downstream applications in personalized video synthesis and dubbing. The inclusion of baseline comparisons and ablation studies isolating the interleaved scheduling strengthens the evidential basis.

major comments (2)

[Experiments] Experiments section: the reported gains in identity similarity and synchronization are quantified against baselines, but the manuscript does not specify the exact composition of the multi-subject test set (number of subjects per video, interaction density) used for the cross-pair evaluation; this detail is load-bearing for assessing whether the SOTA claim generalizes beyond the training distribution.
[§3.2] §3.2, description of MTP-CA: the masking strategy is presented as preventing speech leakage, yet no ablation isolates the contribution of the masking ratio versus the cross-attention design itself; without this, it is difficult to attribute the timbre consistency improvements specifically to MTP-CA rather than the overall training schedule.

minor comments (3)

[Abstract] The abstract contains minor grammatical inconsistencies (e.g., 'MultiModal' capitalization and 'excelling across ... and overall video-audio fidelity').
[Figure 3] Figure 3 (architecture diagram) would benefit from explicit arrows indicating the flow of TTS embeddings into SA-MRoPE to improve readability of the multimodal fusion path.
[Related Work] A few references to prior multimodal RoPE variants are cited but lack direct comparison in the related-work section; adding one sentence contrasting SA-MRoPE with the closest prior work would aid context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analysis.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported gains in identity similarity and synchronization are quantified against baselines, but the manuscript does not specify the exact composition of the multi-subject test set (number of subjects per video, interaction density) used for the cross-pair evaluation; this detail is load-bearing for assessing whether the SOTA claim generalizes beyond the training distribution.

Authors: We agree that explicit details on test-set composition are necessary to evaluate generalization. We will expand Section 4.1 in the revised manuscript to specify the exact composition of the multi-subject test set used for cross-pair evaluation, including the number of subjects per video and quantitative measures of interaction density. revision: yes
Referee: [§3.2] §3.2, description of MTP-CA: the masking strategy is presented as preventing speech leakage, yet no ablation isolates the contribution of the masking ratio versus the cross-attention design itself; without this, it is difficult to attribute the timbre consistency improvements specifically to MTP-CA rather than the overall training schedule.

Authors: We acknowledge that an ablation isolating the masking ratio from the cross-attention design would strengthen attribution of the timbre-consistency gains. We will add this targeted ablation to Section 4.3 of the revised manuscript while retaining the existing module-level ablations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes an architectural framework (OCF, MTP-CA, SA-MRoPE) and training regimen (interleaved scheduling, curriculum) for multimodal audio-video customization, with performance claims grounded in quantitative experiments and ablations against external baselines. No equations, derivations, fitted parameters, or first-principles predictions are presented that could reduce to self-definition or self-citation chains; the central results remain independent and falsifiable via reported metrics on identity similarity, timbre consistency, and synchronization.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on the effectiveness of newly introduced modules and training strategies that are postulated in the abstract without independent verification or external benchmarks.

axioms (2)

domain assumption Interleaved audio-video scheduling rapidly adapts the audio branch to multilingual scenarios without degrading foundational priors
Invoked as part of the comprehensive training strategy in the abstract.
domain assumption Progressive in-pair to cross-pair curriculum facilitates the learning of high-level and robust identity features
Invoked to enable robust identity binding in the training description.

invented entities (3)

Omni-Context Fusion (OCF) module no independent evidence
purpose: enriches the base textual prompt with dense, multimodal identity cues
New module introduced to enable precise binding of multimodal identity information.
Masked TTS Cross-Attention (MTP-CA) mechanism no independent evidence
purpose: prevent the severe speech leakage problem
Explicitly designed component to address audio leakage in joint generation.
Semantic-Anchored Multimodal RoPE (SA-MRoPE) no independent evidence
purpose: anchor visual and audio reference tokens to their corresponding semantic descriptions
Proposed technique for structured multimodal fusion and identity binding.

pith-pipeline@v0.9.0 · 5804 in / 1660 out tokens · 82166 ms · 2026-05-20T14:03:15.169561+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Omni-Context Fusion (OCF) module... Semantic-Anchored Multimodal RoPE (SA-MRoPE)... Masked TTS Cross-Attention (MTP-CA)... interleaved audio-video scheduling... progressive in-pair to cross-pair curriculum
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-stream Diffusion Transformer (DiT) architecture, initialized directly from the pre-trained Ovi backbone

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 11 internal anchors

[1]

Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models.IEEE Transactions on Multimedia, 2025

Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Yaofei Wu, and Wenwu Zhu. Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models.IEEE Transactions on Multimedia, 2025

work page 2025
[2]

First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

work page arXiv 2025
[3]

Humo: Human-centric video generation via collaborative multi-modal conditioning,

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

work page arXiv 2025
[4]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

work page 2022
[5]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255–6271, 2025

work page 2025
[6]

Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025

Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, et al. Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025

work page arXiv 2025
[7]

Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025

work page 2025
[8]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

work page arXiv 2023
[9]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016

work page 2016
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

work page 2019
[12]

Retinaface: Single-shot multi-level face localisation in the wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020

work page 2020
[13]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88 (2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88 (2):303–338, 2010

work page 2010
[15]

Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transform- ers.arXiv preprint arXiv:2504.02436, 2025

work page arXiv 2025
[16]

Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

work page arXiv 2025
[17]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023

work page 2023
[18]

Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Song- tao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

work page arXiv 2026
[19]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In2024 IEEE Spoken Language Technology Workshop (SLT), pages 885–890. IEEE, 2024

work page 2024
[21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[22]

Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024

work page arXiv 2024
[23]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[24]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Harmony: Harmonizing audio and video generation through cross-task synergy.arXiv preprint arXiv:2511.21579, 2025

Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, and Ran Yi. Harmony: Harmonizing audio and video generation through cross-task synergy.arXiv preprint arXiv:2511.21579, 2025

work page arXiv 2025
[26]

Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

work page arXiv 2025
[27]

Videomage: Multi-subject and motion customization of text-to-video diffusion models

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, and Yu- Chiang Frank Wang. Videomage: Multi-subject and motion customization of text-to-video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17603–17612, 2025

work page 2025
[28]

Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025. 11

work page arXiv 2025
[29]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[30]

Transfer learning from speaker verification to multispeaker text-to-speech synthesis.Advances in neural information processing systems, 31, 2018

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis.Advances in neural information processing systems, 31, 2018

work page 2018
[31]

Vace: All-in- one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

work page 2025
[32]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021
[33]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

work page 1931
[35]

Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7752–7762, 2025

work page 2025
[36]

Bindweave: Subject-consistent video generation via cross-modal integration,

Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross- modal integration.arXiv preprint arXiv:2510.00438, 2025

work page arXiv 2025
[37]

Movie weaver: Tuning-free multi-concept video personalization with anchored prompts

Feng Liang, Haoyu Ma, Zecheng He, Tingbo Hou, Ji Hou, Kunpeng Li, Xiaoliang Dai, Felix Juefei-Xu, Samaneh Azadi, Animesh Sinha, et al. Movie weaver: Tuning-free multi-concept video personalization with anchored prompts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13146–13156, 2025

work page 2025
[38]

JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization,

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Jiebo Luo, Ziwei Liu, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

work page arXiv 2025
[39]

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14951– 14961, 2025

work page 2025
[40]

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models.Advances in Neural Information Processing Systems, 36:48855–48876, 2023

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models.Advances in Neural Information Processing Systems, 36:48855–48876, 2023

work page 2023
[42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[43]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 12

work page 2023
[44]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[45]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[46]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023
[47]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022
[48]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025

work page arXiv 2025
[50]

Decouple content and motion for conditional image-to-video generation

Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, and Jinzhi Wang. Decouple content and motion for conditional image-to-video generation. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4757–4765, 2024

work page 2024
[51]

Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

work page arXiv 2026
[52]

Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

work page arXiv 2023
[53]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

UniVerse-1: Unified audio-video generation via stitching of experts,

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

work page arXiv 2025
[55]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025

work page arXiv 2025
[58]

Identity-preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025. 13

work page 2025
[59]

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.International Journal of Computer Vision, 134(1):46, 2026

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Bin Liu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.International Journal of Computer Vision, 134(1):46, 2026

work page 2026
[60]

Kaleido: Open-sourced multi-subject reference video generation model,

Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, and Meng Wang. Kaleido: Open-sourced multi-subject reference video generation model.arXiv preprint arXiv:2510.18573, 2025

work page arXiv 2025
[61]

Speak foreign languages with your own voice: Cross-lingual neural codec language modeling.arXiv preprint arXiv:2303.03926, 2023

Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling.arXiv preprint arXiv:2303.03926, 2023

work page arXiv 2023
[62]

Motiondirector: Motion customization of text-to-video diffusion models

Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng Shou. Motiondirector: Motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 273–290. Springer, 2024

work page 2024
[63]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models.IEEE Transactions on Multimedia, 2025

Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Yaofei Wu, and Wenwu Zhu. Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning on language-video foundation models.IEEE Transactions on Multimedia, 2025

work page 2025

[2] [2]

First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

work page arXiv 2025

[3] [3]

Humo: Human-centric video generation via collaborative multi-modal conditioning,

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

work page arXiv 2025

[4] [4]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

work page 2022

[5] [5]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255–6271, 2025

work page 2025

[6] [6]

Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025

Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, et al. Phantom-data: Towards a general subject-consistent video generation dataset.arXiv preprint arXiv:2506.18851, 2025

work page arXiv 2025

[7] [7]

Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025

work page 2025

[8] [8]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

work page arXiv 2023

[9] [9]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016

work page 2016

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

work page 2019

[12] [12]

Retinaface: Single-shot multi-level face localisation in the wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020

work page 2020

[13] [13]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88 (2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88 (2):303–338, 2010

work page 2010

[15] [15]

Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transform- ers.arXiv preprint arXiv:2504.02436, 2025

work page arXiv 2025

[16] [16]

Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

work page arXiv 2025

[17] [17]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023

work page 2023

[18] [18]

Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Song- tao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

work page arXiv 2026

[19] [19]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In2024 IEEE Spoken Language Technology Workshop (SLT), pages 885–890. IEEE, 2024

work page 2024

[21] [21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[22] [22]

Id- animator: Zero-shot identity-preserving human video gener- ation.arXiv:2404.15275, 2024

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024

work page arXiv 2024

[23] [23]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[24] [24]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Harmony: Harmonizing audio and video generation through cross-task synergy.arXiv preprint arXiv:2511.21579, 2025

Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, and Ran Yi. Harmony: Harmonizing audio and video generation through cross-task synergy.arXiv preprint arXiv:2511.21579, 2025

work page arXiv 2025

[26] [26]

Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

work page arXiv 2025

[27] [27]

Videomage: Multi-subject and motion customization of text-to-video diffusion models

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, and Yu- Chiang Frank Wang. Videomage: Multi-subject and motion customization of text-to-video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17603–17612, 2025

work page 2025

[28] [28]

Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing.arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025. 11

work page arXiv 2025

[29] [29]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024

[30] [30]

Transfer learning from speaker verification to multispeaker text-to-speech synthesis.Advances in neural information processing systems, 31, 2018

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis.Advances in neural information processing systems, 31, 2018

work page 2018

[31] [31]

Vace: All-in- one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

work page 2025

[32] [32]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021

[33] [33]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

work page 1931

[35] [35]

Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7752–7762, 2025

work page 2025

[36] [36]

Bindweave: Subject-consistent video generation via cross-modal integration,

Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross- modal integration.arXiv preprint arXiv:2510.00438, 2025

work page arXiv 2025

[37] [37]

Movie weaver: Tuning-free multi-concept video personalization with anchored prompts

Feng Liang, Haoyu Ma, Zecheng He, Tingbo Hou, Ji Hou, Kunpeng Li, Xiaoliang Dai, Felix Juefei-Xu, Samaneh Azadi, Animesh Sinha, et al. Movie weaver: Tuning-free multi-concept video personalization with anchored prompts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13146–13156, 2025

work page 2025

[38] [38]

JavisDiT: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization,

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Jiebo Luo, Ziwei Liu, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

work page arXiv 2025

[39] [39]

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14951– 14961, 2025

work page 2025

[40] [40]

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models.Advances in Neural Information Processing Systems, 36:48855–48876, 2023

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models.Advances in Neural Information Processing Systems, 36:48855–48876, 2023

work page 2023

[42] [42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[43] [43]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 12

work page 2023

[44] [44]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[45] [45]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015

[46] [46]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023

[47] [47]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022

[48] [48]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025

work page arXiv 2025

[50] [50]

Decouple content and motion for conditional image-to-video generation

Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, and Jinzhi Wang. Decouple content and motion for conditional image-to-video generation. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4757–4765, 2024

work page 2024

[51] [51]

Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

work page arXiv 2026

[52] [52]

Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

work page arXiv 2023

[53] [53]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

UniVerse-1: Unified audio-video generation via stitching of experts,

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

work page arXiv 2025

[55] [55]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject- to-video generation.arXiv preprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025

work page arXiv 2025

[58] [58]

Identity-preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025. 13

work page 2025

[59] [59]

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.International Journal of Computer Vision, 134(1):46, 2026

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Bin Liu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.International Journal of Computer Vision, 134(1):46, 2026

work page 2026

[60] [60]

Kaleido: Open-sourced multi-subject reference video generation model,

Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, and Meng Wang. Kaleido: Open-sourced multi-subject reference video generation model.arXiv preprint arXiv:2510.18573, 2025

work page arXiv 2025

[61] [61]

Speak foreign languages with your own voice: Cross-lingual neural codec language modeling.arXiv preprint arXiv:2303.03926, 2023

Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling.arXiv preprint arXiv:2303.03926, 2023

work page arXiv 2023

[62] [62]

Motiondirector: Motion customization of text-to-video diffusion models

Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng Shou. Motiondirector: Motion customization of text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 273–290. Springer, 2024

work page 2024

[63] [63]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024