Training-Free Multimodal Guidance for Video to Audio Generation

· 2025 · cs.LG · arXiv 2509.24550

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

representative citing papers

Training-Free Multimodal Guidance for Video to Audio Generation

cs.LG · 2025-09-29 · unverdicted · novelty 4.0

Proposes a plug-and-play multimodal diffusion guidance mechanism that improves video-to-audio generation quality and alignment by enforcing unified multimodal coherence on pretrained audio diffusion models.

citing papers explorer

Showing 1 of 1 citing paper.

Training-Free Multimodal Guidance for Video to Audio Generation cs.LG · 2025-09-29 · unverdicted · none · ref 2 · internal anchor
Proposes a plug-and-play multimodal diffusion guidance mechanism that improves video-to-audio generation quality and alignment by enforcing unified multimodal coherence on pretrained audio diffusion models.

Training-Free Multimodal Guidance for Video to Audio Generation

fields

years

verdicts

representative citing papers

citing papers explorer