hub

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation

· 2025 · arXiv 2508.16930

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

cs.SD · 2026-05-03 · unverdicted · novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rhythm while a new baseline performs competitively.

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

cs.SD · 2026-04-12 · unverdicted · novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

cs.SD · 2026-04-06 · unverdicted · novelty 7.0

OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

cs.SD · 2025-12-30 · unverdicted · novelty 7.0 · 2 refs

PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.

AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.

VABench: A Comprehensive Benchmark for Audio-Video Generation

cs.CV · 2025-12-10 · unverdicted · novelty 7.0

VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

cs.CV · 2025-11-29 · conditional · novelty 7.0

MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.

WavFlow: Audio Generation in Waveform Space

cs.SD · 2026-05-18 · conditional · novelty 6.0

WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

cs.MM · 2026-04-16 · unverdicted · novelty 6.0

ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

cs.MM · 2026-05-18

citing papers explorer

Showing 13 of 13 citing papers.

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation cs.SD · 2026-05-03 · unverdicted · none · ref 14
TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rhythm while a new baseline performs competitively.
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories cs.SD · 2026-04-12 · unverdicted · none · ref 42
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text cs.SD · 2026-04-06 · unverdicted · none · ref 39
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
Omni2Sound: Towards Unified Video-Text-to-Audio Generation cs.SD · 2026-01-06 · unverdicted · none · ref 11
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation cs.SD · 2025-12-30 · unverdicted · none · ref 32 · 2 links
PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner cs.CV · 2025-12-11 · unverdicted · none · ref 54
AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
VABench: A Comprehensive Benchmark for Audio-Video Generation cs.CV · 2025-12-10 · unverdicted · none · ref 39
VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection cs.CV · 2025-11-29 · conditional · none · ref 42
MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.
WavFlow: Audio Generation in Waveform Space cs.SD · 2026-05-18 · conditional · none · ref 18
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling cs.MM · 2026-04-16 · unverdicted · none · ref 39
ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 49
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence cs.CV · 2026-04-10 · unverdicted · none · ref 37
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation cs.MM · 2026-05-18 · unreviewed · ref 12

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer