VGGSounder: Audio-Visual Evaluations for Foundation Models

Ameya Prabhu; A. Sophia Koepke; Daniil Zverev; Matthias Bethge; Thadd\"aus Wiedemer; Wieland Brendel

arxiv: 2508.08237 · v4 · pith:UEFX2XFQnew · submitted 2025-08-11 · 💻 cs.MM · cs.AI· cs.CV· cs.SD· eess.AS

VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev , Thadd\"aus Wiedemer , Ameya Prabhu , Matthias Bethge , Wieland Brendel , A. Sophia Koepke This is my paper

classification 💻 cs.MM cs.AIcs.CVcs.SDeess.AS

keywords audio-visualfoundationlimitationsmodalitymodelsvggsoundvggsounderevaluations

0 comments

read the original abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni2Sound: Towards Unified Video-Text-to-Audio Generation
cs.SD 2026-01 unverdicted novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.