Presents multi-verifier framework and Adaptive Reward Weighting (ARW) for inference-time scaling in joint audio-video generation, reporting gains in alignment and synchronization on VGGSound and JavisBench-mini.
Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation featuring four dimensions, challenging scenarios, and an adaptive hybrid evaluation framework that achieves 91.5% Spearman correlation with human judgments.
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
citing papers explorer
-
Inference-Time Scaling for Joint Audio-Video Generation
Presents multi-verifier framework and Adaptive Reward Weighting (ARW) for inference-time scaling in joint audio-video generation, reporting gains in alignment and synchronization on VGGSound and JavisBench-mini.