arxiv: 2510.01284 · v1 · submitted 2025-09-30 · 💻 cs.MM · cs.CV· cs.SD· eess.AS

Recognition: 2 theorem links

· Lean Theorem

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low , Weimin Wang , Calder Katyal

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:58 UTC · model grok-4.3

classification 💻 cs.MM cs.CVcs.SDeess.AS

keywords audio-video generationcross-modal fusiontwin backbonediffusion transformermultimodal synchronizationunified generative model

0 comments

The pith

Ovi generates audio and video together in one process by fusing twin identical diffusion transformers block by block.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ovi as a single generative model that produces both video and its matching audio instead of handling them through separate stages or fixes afterward. It runs two copies of the same transformer architecture side by side, one for each modality, and lets them exchange timing cues and semantic content at every processing step. The audio copy begins with the exact structure of a pretrained video model and learns to create realistic sounds and expressive speech by training on large amounts of video data. This setup produces movie-style clips where voices and sound effects line up naturally with the visuals. A reader would care if the method proves simpler and more consistent than current multi-pipeline approaches for audiovisual creation.

Core claim

Ovi treats audio-video generation as one unified process by initializing an audio tower with the identical architecture to a strong pretrained video model, then jointly training both towers through blockwise bidirectional cross-attention for semantics and scaled-RoPE embeddings for timing on a large video corpus, thereby achieving natural synchronization without separate pipelines or post-hoc alignment.

What carries the argument

Blockwise cross-modal fusion between twin-DiT modules that exchange timing via scaled-RoPE embeddings and semantics via bidirectional cross-attention.

If this is right

Natural synchronization emerges directly from the joint training rather than from later corrections.
Separate audio and video synthesis pipelines become unnecessary.
The model can produce realistic sound effects along with speech that carries speaker identity and emotion.
Cinematic storytelling clips can be generated from the single unified architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sharing the exact same backbone structure between modalities may allow visual pretraining to transfer more directly to audio generation than independent training would.
The blockwise fusion pattern could be tested on longer sequences to check whether alignment remains stable over extended durations.
If the approach generalizes, similar twin setups might simplify adding further modalities such as text or depth without building new alignment stages.
Production workflows for film or games could reduce manual sound design steps if the generated effects match scenes reliably.

Load-bearing premise

Initializing an audio tower with the exact same architecture as a pretrained video model and then training the pair jointly with only bidirectional cross-attention will produce reliable cross-modal timing and semantic alignment without extra alignment losses or post-processing.

What would settle it

A test set of videos with rapid motion or overlapping sounds where the generated audio and video show clear timing offsets or mismatched content when inspected frame by frame.

read the original abstract

Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ovi's twin identical DiT towers with blockwise scaled-RoPE and cross-attention fusion is a clean unified architecture, but the synchronization claims rest on demos without visible metrics or ablations.

read the letter

The main takeaway is that Ovi trains two identical DiT backbones—one for video, one for audio—by swapping timing info through scaled RoPE embeddings and semantics through bidirectional cross-attention at each block. They start the audio tower from a pretrained video model and optimize everything jointly on raw video data, claiming this produces naturally synced outputs without separate pipelines or later fixes.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ovi, a unified model for audio-video generation that treats both modalities as a single generative process. It uses twin DiT backbones with blockwise cross-modal fusion: bidirectional cross-attention exchanges semantics while scaled-RoPE embeddings handle timing. An audio tower is initialized with the identical architecture to a strong pretrained video model and the pair is jointly trained on a large video corpus. The central claim is that this architecture produces movie-grade clips with natural synchronization, rich speaker identity/emotion in speech, and context-matched sound effects, while eliminating separate pipelines or post-hoc alignment. Code, model weights, and demos are released.

Significance. If the result holds, the work would be significant as a simplification of audio-visual generative architectures, showing that identical twin towers plus cross-attention and scaled positional encodings can learn precise cross-modal alignment from joint training on raw video data alone. The explicit release of code, weights, and demos is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Abstract] Abstract: the claim that the model 'achieves natural synchronization and removes the need for separate pipelines or post hoc alignment' is load-bearing for the central contribution yet rests entirely on qualitative demonstration; no quantitative metrics (e.g., lip-sync error, AV correlation, or comparison to baselines), ablation studies, or error analysis are reported, so the strength of the claim cannot be verified.
[Method] Method (blockwise fusion and scaled-RoPE description): the assumption that bidirectional cross-attention plus scaled-RoPE will automatically enforce fine-grained temporal phase locking between mismatched audio and video sampling rates, without auxiliary alignment losses or post-processing, is not supported by any analysis or ablation; this directly underpins the claim that no separate alignment stage is required.

minor comments (2)

[Abstract] Abstract: the training description states the audio tower is 'trained from scratch on hundreds of thousands of hours of raw audio' while fusion occurs 'on a vast video corpus'; clarify whether the audio tower receives additional audio-only pretraining or only joint video data.
[Method] Notation: the exact formulation of the blockwise exchange (which layers, how queries/keys/values are projected across modalities) would benefit from an equation or pseudocode block for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback on our work. Below we respond to each of the major comments in turn. We have prepared revisions to the manuscript to incorporate additional quantitative evaluations and analyses as suggested.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the model 'achieves natural synchronization and removes the need for separate pipelines or post hoc alignment' is load-bearing for the central contribution yet rests entirely on qualitative demonstration; no quantitative metrics (e.g., lip-sync error, AV correlation, or comparison to baselines), ablation studies, or error analysis are reported, so the strength of the claim cannot be verified.

Authors: We acknowledge that the central claim in the abstract currently rests on qualitative demonstrations and released demos. To strengthen verifiability, we will add a dedicated quantitative evaluation subsection reporting lip-sync error (via SyncNet), audio-visual correlation scores, and direct comparisons against relevant baselines. These results, along with a brief error analysis, will be included in the revised manuscript. revision: yes
Referee: [Method] Method (blockwise fusion and scaled-RoPE description): the assumption that bidirectional cross-attention plus scaled-RoPE will automatically enforce fine-grained temporal phase locking between mismatched audio and video sampling rates, without auxiliary alignment losses or post-processing, is not supported by any analysis or ablation; this directly underpins the claim that no separate alignment stage is required.

Authors: The referee is correct that the method section lacks explicit analysis or ablation supporting the temporal alignment properties of blockwise cross-attention and scaled-RoPE. We will revise the method description to include an ablation study isolating these components and a short discussion of how joint training on raw video data enables implicit learning of phase locking without auxiliary losses or post-processing. revision: yes

Circularity Check

0 steps flagged

No circularity: synchronization emerges from described joint training procedure

full rationale

The paper presents Ovi as a unified generative model using twin-DiT towers with blockwise bidirectional cross-attention for semantics and scaled-RoPE for timing, trained jointly on a video corpus. The claimed natural synchronization is positioned as an outcome of this standard joint optimization rather than a fitted hyperparameter or self-referential definition. No equations reduce reported alignment performance to inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked in the provided text. The derivation chain remains self-contained as an empirical training claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that identical DiT architectures plus blockwise cross-attention suffice for synchronization; no new physical entities are introduced.

axioms (2)

domain assumption Diffusion transformers can be trained from scratch on raw audio to produce realistic sound effects and emotional speech when given sufficient data.
Invoked in the description of audio-tower pre-training on hundreds of thousands of hours of raw audio.
ad hoc to paper Bidirectional cross-attention between identical video and audio towers will transfer timing and semantic information without additional alignment objectives.
Central to the blockwise fusion claim; no separate loss term is mentioned.

pith-pipeline@v0.9.0 · 5502 in / 1320 out tokens · 61104 ms · 2026-05-17T00:58:33.434002+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

blockwise cross-modal fusion of twin-DiT modules... bidirectional cross-attention... scaled-RoPE embeddings
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

scaled RoPE frequencies of the audio branch by 31/157 ≈ 0.197

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
cs.SD 2025-12 accept novelty 8.0

PhyAVBench supplies the first benchmark and contrastive metric that measures whether text-to-audio-video models respect real-world audio physics across controlled prompt pairs.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
cs.SD 2026-05 unverdicted novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
cs.CV 2026-04 unverdicted novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...
Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
cs.AI 2026-04 unverdicted novelty 7.0

Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
cs.GR 2026-01 unverdicted novelty 7.0

JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
cs.CV 2025-12 unverdicted novelty 7.0

AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
cs.CV 2026-05 unverdicted novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
cs.CV 2026-04 unverdicted novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
cs.CV 2026-03 unverdicted novelty 6.0

A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
cs.CV 2026-04 unverdicted novelty 5.0

Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
LTX-2: Efficient Joint Audio-Visual Foundation Model
cs.CV 2026-01 conditional novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

com/research/video-generation-models-as-world-simulators

URLhttps://openai. com/research/video-generation-models-as-world-simulators. Accessed: 2025-09-24. Gehui Chen, Guan’an Wang, Xiaowen Huang, and Jitao Sang. Semantically consistent video-to- audio generation using multimodal language large model.arXiv preprint arXiv:2404.16305, 2024a. Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound:...

work page arXiv 2025
[3]

Humo: Human-centric video generation via collaborative multi-modal conditioning

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025a. Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. Hunyuanvideo...

work page arXiv
[4]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.arXiv preprint arXiv:2407.05407,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Ace-step: A step towards music generation foundation model.arXiv preprint arXiv:2506.00045,

Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, and Joe Guo. Ace-step: A step towards music generation foundation model.arXiv preprint arXiv:2506.00045,

work page arXiv
[6]

Accessed: 2025-09-24

URLhttps://storage.googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf. Accessed: 2025-09-24. Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications.arXiv preprint arXiv:2409.03283,

work page arXiv 2025
[7]

Taming data and transformers for audio generation.arXiv preprint arXiv:2406.19388,

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, and Vicente Ordonez. Taming data and transformers for audio generation.arXiv preprint arXiv:2406.19388,

work page arXiv
[8]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Make-an-audio 2: Temporal-enhanced text-to-audio gen- eration.arXiv preprint arXiv:2305.18474,

Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Temporal-enhanced text-to-audio gen- eration.arXiv preprint arXiv:2305.18474,

work page arXiv
[10]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

work page arXiv
[12]

Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthe- sis.arXiv preprint arXiv:2411.01156,

Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing. Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthe- sis.arXiv preprint arXiv:2411.01156,

work page arXiv
[13]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377,

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377,

work page arXiv
[15]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models

Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099,

work page arXiv
[16]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Universe-1: Unified audio-video generation via stitching of experts

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155,

work page arXiv
[18]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023
[19]

Magicinfinite: Generating infinite talking videos with your words and voice.arXiv preprint arXiv:2503.05978,

Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, et al. Magicinfinite: Generating infinite talking videos with your words and voice.arXiv preprint arXiv:2503.05978,

work page arXiv
[20]

Deepaudio- v1: Towards multi-modal multi-stage end-to-end video to speech and audio generation.arXiv preprint arXiv:2503.22265,

Haomin Zhang, Chang Liu, Junjie Zheng, Zihao Chen, Chaofan Ding, and Xinhan Di. Deepaudio- v1: Towards multi-modal multi-stage end-to-end video to speech and audio generation.arXiv preprint arXiv:2503.22265,

work page arXiv