arxiv: 2601.22143 · v2 · submitted 2026-01-29 · 💻 cs.GR · cs.CV

Recognition: no theorem link

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

Anthony Chen , Naomi Ken Korem , Gal Zeevi , Tavi Halperin , Matan Ben Yosef , Urska Jelercic , Ofir Bibi , Or Patashnik

show 1 more author

Daniel Cohen-Or

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:52 UTC · model grok-4.3

classification 💻 cs.GR cs.CV

keywords video dubbingaudio-visual diffusionLoRA adaptationlip synchronizationmultilingual videoface inpaintinggenerative modelsjoint audio-visual generation

0 comments

The pith

Adapting a pre-trained audio-visual diffusion model with lightweight LoRA enables joint generation of translated audio and synchronized facial motion for video dubbing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that video dubbing can be performed by a single adapted diffusion model instead of multi-stage pipelines. It trains the adaptation by having the model generate its own paired multilingual clips through internal language switches followed by targeted inpainting of faces and audio. A sympathetic reader would care because the approach preserves speaker identity while delivering better visual quality, lip alignment, and handling of natural motion than prior methods. If correct, it reduces reliance on curated paired datasets and complex task-specific engineering for multilingual video translation.

Core claim

The paper claims that conditioning an audio-visual foundation model on an input video via a LoRA, and training that LoRA on synthetic paired data created by generating language-switched clips then inpainting each half to match the opposite language, produces dubbed videos with higher visual fidelity, more accurate lip synchronization, and greater robustness to complex motion than existing dubbing pipelines.

What carries the argument

The lightweight LoRA that adapts a joint audio-visual diffusion model to condition on an input audio-video clip and generate translated audio together with matching facial animations.

If this is right

Dubbed videos maintain consistent speaker identity when audio is translated to a new language.
Lip movements align more closely with the new audio track without separate alignment modules.
The single model handles complex body motion and real-world lighting better than pipeline approaches.
No large external collection of paired multilingual video data is required for training.
Visual quality and robustness improve over methods that treat audio translation and face animation as separate stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-synthesis strategy for creating training pairs could extend to other cross-modal video editing tasks such as emotion transfer.
If inference speed remains practical, the method could support on-device or live dubbing applications.
The success of the approach implies that strong generative priors in foundation models can substitute for hand-crafted dubbing architectures in many settings.
Broader testing across diverse accents and speaking rates would clarify how far the robustness extends beyond the training distribution.

Load-bearing premise

The base generative model can synthesize paired multilingual videos of the same speaker via language switches and inpainting without introducing artifacts that degrade the LoRA adaptation.

What would settle it

A side-by-side evaluation on real-world test videos where the method's outputs show lower lip-synchronization accuracy or more visible visual artifacts than outputs from current multi-stage dubbing systems would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2601.22143 by Anthony Chen, Daniel Cohen-Or, Gal Zeevi, Matan Ben Yosef, Naomi Ken Korem, Ofir Bibi, Or Patashnik, Tavi Halperin, Urska Jelercic.

**Figure 1.** Figure 1: Video dubbing via joint audio–visual generation. Top: an input video with spoken dialogue in the source language. Bottom: the same video dubbed into a target language, generated by our trained model built on top of an audio–visual foundation backbone. Translated speech and lip motion are produced jointly, while the visual context (such as scene dynamics, face expressions, and body movements), speaker ident… view at source ↗

**Figure 2.** Figure 2: Pipeline for Generating Paired Audio-Visual Dubbing Data. The pipeline consists of two stages. First, the audio–visual generation model produces a contiguous sequence containing a context clip (e.g., spoken English) followed by a target clip (e.g., spoken French). Second, the audio and lip-region video of the target clip are masked, and the same unified model is used in an inpainting setting to regenerate … view at source ↗

**Figure 3.** Figure 3: The Identity–Pronunciation Trade-off. Naïve audio inpainting reveals a fundamental conflict between preserving speaker identity and achieving linguistically correct pronunciation. When denoising from scratch (Left), the model exhibits voice drift, failing to preserve the speaker’s vocal identity. When conditioning on the source audio to maintain identity (Middle), phonetic and prosodic patterns leak acros… view at source ↗

**Figure 4.** Figure 4: Model Training. Our framework follows an in-context generation paradigm where clean context audio-visual pairs are concatenated with noised target pairs. We fine-tune only LoRA adapters while keeping a pretrained Audio-Visual (AV) Diffusion Transformer frozen. Conditioned on a text prompt (e.g. “The person is speaking in French”), the model learns to propagate edits from the context while maintaining temp… view at source ↗

**Figure 5.** Figure 5: User Study Results. We compare our method against LatentSync and HeyGen through a user study, evaluating Lip Synchronization, Prompt Adherence, and Overall Quality. Results indicate that participants prefer our method over baselines across all evaluated metrics. better reflect real-world deployment scenarios, we further introduce a more challenging evaluation set by collecting 25 real videos from YouTube a… view at source ↗

**Figure 6.** Figure 6: Qualitative Comparisons on Audiovisual Alignment and Temporal Structure Our joint audiovisual model accurately maintains both visual and auditory contexts, outperforming baseline methods on key audiovisual synchronization tasks. state-of-the-art methods from each domain. Specifically, for the visual dubbing task, we evaluate MuseTalk [Zhang et al. 2025b] and LatentSync [Li et al. 2024]. For cross-lingual v… view at source ↗

**Figure 7.** Figure 7: Preservation of Non-Dialogue Events and Scene Grounding. Our joint generative framework enables holistic scene modeling where visual dynamics and acoustic events co-evolve. Top: In the "dog barking" scenario, our method synchronizes the timing of environmental sounds with physical gestures, whereas baselines like HeyGen often omit or misalign these cues. Bottom: In the "eating while talking" example, our u… view at source ↗

**Figure 8.** Figure 8: Qualitative Comparisons. Top: Profile views and occlusions. Bottom: Non-human scenarios. Baseline methods exhibit noticeable artifacts and often fail under these challenging conditions, while our method robustly preserves identity and visual coherence while synchronizing the lips with our generated translation audio [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Example of sampled benchmark videos. HDFT and TalkVid consist mainly frontal faces, limited pose variation, and clean acoustic conditions. Our curated benchmark exhibits profile views, significant pose shifts, occlusions, and stylized appearances. Corrupted Test Input Output w/o Fine Mask Output w/ Fine Mask [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: We overlay a green mask on the lip region of the input video and [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11 [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: ASync metrics overfit to frontal videos. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts an audio-visual diffusion model with a LoRA for video dubbing by training on self-generated language-switched pairs, which keeps the pipeline simple but leaves open whether the synthetic data actually delivers clean gains.

read the letter

The main takeaway is a single LoRA fine-tune on a pre-trained joint audio-visual diffusion model that handles both translated audio and synchronized face motion for dubbing. They create training pairs internally by generating clips with language switches inside one video, then inpainting the face and audio halves to match the opposite language. This avoids the usual stack of separate translation, animation, and blending modules that most dubbing systems use today. The approach leans on the foundation model's existing priors to hold onto speaker identity and handle messy real-world motion, which is a practical step forward for media localization work. The abstract shows they get better visual fidelity, lip sync, and robustness than prior pipelines, at least on their test cases. The data synthesis trick is the clearest new element here, since it sidesteps the need for real multilingual recordings of the same speaker. That said, the training data comes entirely from the same base model, so any systematic issues the foundation model already has with cross-language phoneme mapping or inpainting discontinuities could get reinforced rather than corrected. The abstract does not include checks against actual ground-truth multilingual footage or ablations that isolate the inpainting step, which makes the performance claims harder to weigh. If the synthetic pairs introduce consistent artifacts, the LoRA might just be learning to live with them instead of fixing the underlying problem. This is worth reading for people working on generative models for video and audio who care about practical downstream tasks like dubbing or accessibility. The method is concrete enough that a referee could evaluate the data pipeline and the quantitative comparisons directly. I would send it to peer review with a request for more detail on how the synthetic pairs were validated and how the results compare to stronger baselines that also use the same foundation model.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JUST-DUB-IT, a single-model approach that adapts a pre-trained audio-visual diffusion foundation model via lightweight LoRA for video-to-video dubbing. Training pairs are synthesized by the base model itself: multilingual clips are generated with intra-clip language switches, after which face and audio are inpainted in each half to match the opposite language. The central claim is that the resulting LoRA produces dubbed videos with improved visual fidelity, lip synchronization, and robustness relative to existing multi-stage dubbing pipelines.

Significance. If the synthetic training pairs prove free of systematic artifacts, the method would offer a streamlined alternative to task-specific dubbing pipelines by directly exploiting generative priors from audio-visual foundation models. The self-supervised data-generation strategy is a notable technical contribution that could extend to other cross-lingual audio-visual tasks, provided it supplies unbiased supervision for lip motion and identity preservation.

major comments (2)

[Method / Training data synthesis] Training-data synthesis (described in the abstract and method): the procedure generates all supervision from the same foundation model via language switches followed by face/audio inpainting. For the performance claims to hold, these pairs must supply clean, unbiased targets that improve phoneme-to-viseme mapping and motion continuity; no quantitative comparison against real multilingual ground-truth recordings or ablation isolating inpainting artifacts is reported, leaving open the possibility that the LoRA merely reproduces base-model failure modes.
[Results / Experiments] Evaluation (abstract and results): the claim of improved visual fidelity, lip synchronization, and robustness is stated without reference to specific metrics, baseline implementations, dataset sizes, or statistical tests. Because the only adaptation step is the LoRA trained on synthetic pairs, any unmeasured bias in the synthetic distribution directly limits the validity of the cross-method comparison.

minor comments (2)

[Method] Specify the precise conditioning inputs to the LoRA (e.g., exact audio and visual feature channels) and the chosen LoRA rank and training hyperparameters.
[Discussion] Add a short discussion of failure cases, such as extreme head motion or background audio interference, to clarify the robustness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional validation of the synthetic training data and more rigorous reporting of experimental details are necessary to strengthen the claims. The revised manuscript will incorporate clarifications, new ablations, and expanded evaluation sections to address these points directly.

read point-by-point responses

Referee: [Method / Training data synthesis] Training-data synthesis (described in the abstract and method): the procedure generates all supervision from the same foundation model via language switches followed by face/audio inpainting. For the performance claims to hold, these pairs must supply clean, unbiased targets that improve phoneme-to-viseme mapping and motion continuity; no quantitative comparison against real multilingual ground-truth recordings or ablation isolating inpainting artifacts is reported, leaving open the possibility that the LoRA merely reproduces base-model failure modes.

Authors: We acknowledge the importance of verifying that the synthetic pairs provide unbiased supervision. The intra-clip language-switch strategy ensures that the two halves share identical speaker identity, pose, and background, with only language-specific audio and lip motion differing; the subsequent inpainting step then uses the base model's joint audio-visual prior to generate the target half. This construction is intended to avoid the domain gaps that arise when mixing real recordings from different sources. In the revision we will add (i) a quantitative assessment of the synthetic pairs against a small set of available real multilingual clips using lip-sync error (LSE) and face-identity cosine similarity, and (ii) an ablation that trains the LoRA on pairs generated with and without the inpainting stage, reporting the resulting differences in downstream dubbing metrics. revision: partial
Referee: [Results / Experiments] Evaluation (abstract and results): the claim of improved visual fidelity, lip synchronization, and robustness is stated without reference to specific metrics, baseline implementations, dataset sizes, or statistical tests. Because the only adaptation step is the LoRA trained on synthetic pairs, any unmeasured bias in the synthetic distribution directly limits the validity of the cross-method comparison.

Authors: We apologize for the insufficient detail in the original submission. The revised version will explicitly list all evaluation metrics (FID for visual quality, LSE for lip synchronization, and mean opinion scores from a user study for perceived robustness), provide implementation details and citations for every baseline, state the exact number of test videos and speakers used, and include statistical significance tests (paired t-tests with p-values) for all reported improvements. We will also add a short discussion of possible synthetic-data biases and how the joint diffusion prior helps mitigate them relative to cascaded pipelines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper adapts an external pre-trained audio-visual diffusion foundation model via lightweight LoRA fine-tuning. Training data consists of synthetic multilingual pairs generated by the same base model through language switches and inpainting, but the central performance claims (improved visual fidelity, lip synchronization, and robustness) are positioned as empirical outcomes measured against existing external dubbing pipelines rather than quantities defined by construction from the training pairs or any fitted parameters. No equations, self-citations, uniqueness theorems, or ansatzes appear in the abstract or description that reduce the claimed results to the inputs by definition; the method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the pre-trained audio-visual diffusion model possesses sufficiently rich generative priors to both synthesize usable training pairs and adapt via LoRA for dubbing without major artifacts.

free parameters (1)

LoRA rank and training hyperparameters
Specific rank, learning rate, and adaptation details are required for the lightweight fine-tuning but are not reported in the abstract.

axioms (1)

domain assumption Audio-visual foundation models have strong priors for generating synchronized sound and visuals that transfer to dubbing tasks.
Invoked to justify both the synthetic data generation and the effectiveness of the LoRA adaptation.

pith-pipeline@v0.9.0 · 5545 in / 1365 out tokens · 34598 ms · 2026-05-16T09:52:57.967144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 10 internal anchors

[1]

Real-time eye blink detection using facial landmarks.Cent. Mach. Perception, Dep. Cybern. Fac. Electr. Eng. Czech Tech. Univ. Prague(2016), 1–8. Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, and Benyou Wang. 2025a. TalkVid: A Large-Scale Divers...

work page doi:10.1109/taslpro.2025.3530270 2016
[2]

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

T-foley: A controllable waveform- domain diffusion model for temporal-event-guided foley sound synthesis. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6820–6824. Chaoqun Cui, Liangbin Huang, Shijing Wang, Zhe Tong, Zhaolong Huang, Xiao Zeng, and Xiaofeng Liu. 2025a. Fine-grained Video Dubbing ...

work page doi:10.18653/v1/2025.acl- 2024
[3]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. arXiv:2407.05407 [cs.SD] https://arxiv.org/abs/2407.05407 Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, K...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. arXiv:2403.03206 [cs.CV] https://arxiv.org/abs/2403.03206 Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu HU, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, and Jingdong Wang

work page internal anchor Pith review Pith/arXiv arXiv
[5]

LTX-2: Efficient Joint Audio- Visual Foundation Model. arXiv:2601.03233 [cs.CV] https://arxiv.org/abs/2601.03233 Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi

work page internal anchor Pith review Pith/arXiv arXiv
[6]

LTX-Video: Realtime Video Latent Diffusion

LTX-Video: Realtime Video Latent Diffusion.arXiv preprint arXiv:2501.00103(2024). Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, and Zhiyong Wu

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

arXiv:2512.25066 [cs.CV] https://arxiv.org/abs/2512.25066 Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

From Inpaint- ing to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing. arXiv:2512.25066 [cs.CV] https://arxiv.org/abs/2512.25066 Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

work page arXiv
[8]

https://www.heygen.com/

HeyGen – AI Video Generator. https://www.heygen.com/. Accessed: 2026-01-07. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

work page 2026
[9]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685 Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou

work page internal anchor Pith review Pith/arXiv arXiv
[10]

In-Context LoRA for Diffusion Transformers.arXiv preprint arxiv:2410.23775(2024). Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao

work page arXiv 2024
[11]

arXiv:2403.03100 [eess.AS] https://arxiv.org/abs/2403.03100 Black Forest Labs

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. arXiv:2403.03100 [eess.AS] https://arxiv.org/abs/2403.03100 Black Forest Labs

work page arXiv
[12]

Lightricks

LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision.arXiv preprint arXiv:2412.09262(2024). Lightricks. 2026.LTX-2: Official repository for LTX-2 audio-video foundation model. https://github.com/Lightricks/LTX-2 Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le

work page arXiv 2024
[13]

Flow Matching for Generative Modeling

Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] https: //arxiv.org/abs/2210.02747 Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.arXiv preprint arXiv:2301.12503(2023). Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley

work page arXiv 2023
[15]

doi:10.1109/TASLP

AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing32 (2024), 2871–2883. doi:10.1109/TASLP. 2024.3399607 Xingchao Liu, Chengyue Gong, and Qiang Liu

work page doi:10.1109/taslp 2024
[16]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. arXiv:2209.03003 [cs.LG] https: //arxiv.org/abs/2209.03003 Chetwin Low, Weimin Wang, and Calder Katyal

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv:2510.01284 William Peebles and Saining Xie

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation. arXiv:2510.01284 William Peebles and Saining Xie

work page arXiv
[18]

Scalable Diffusion Models with Transformers

Scalable Diffusion Models with Transformers. arXiv preprint arXiv:2212.09748(2022). Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, and Jun He

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers.arXiv preprint arXiv:2505.21448(2025). KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar

work page arXiv 2025
[20]

arXiv:2601.01568 [cs.SD] https://arxiv.org/abs/2601.01568 Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun

MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning. arXiv:2601.01568 [cs.SD] https://arxiv.org/abs/2601.01568 Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun

work page arXiv
[21]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever

OpenVoice: Versatile Instant Voice Cloning.arXiv preprint arXiv:2312.01479(2023). Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever

work page arXiv 2023
[22]

Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356 [eess.AS] https://arxiv.org/abs/2212.04356 Jibin Song, Mingi Kwon, Jaeseok Jeong, and Youngjung Uh

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv:2509.21893 ByteDance Seed Team

Syncphony: Synchro- nized Audio-to-Video Generation with Diffusion Transformers. arXiv:2509.21893 ByteDance Seed Team

work page arXiv
[24]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and Advanced Large-Scale Video Generative Models.arXiv preprint arXiv:2503.20314(2025). Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Barmann, Hazim Kemal Ekenel, and Alexander Waibel

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

In- finiteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing. 12•Anthony Chen, Naomi Ken Korem, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, and Daniel Cohen-Or arXiv:2508.14033 [cs.CV] https://arxiv.org/abs/2508.14033 Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu ...

work page arXiv
[26]

arXiv:2502.04128 [eess.AS] https://arxiv.org/abs/2502.04128 Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis. arXiv:2502.04128 [eess.AS] https://arxiv.org/abs/2502.04128 Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang. 2025c. UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions. a...

work page arXiv
[27]

La situation est devenue bien trop dangereuse pour que nous puissions rester ici

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration- Controlled Auto-Regressive Zero-Shot Text-to-Speech. arXiv:2506.21619 [cs.CL] https://arxiv.org/abs/2506.21619 Supplementary Material A Implementation Details A.1 Training Data We use Gemini to generate 100 structured multilingual prompts spanning seven languages: English, Spanish, Russian,...

work page arXiv 2016