STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Aapo Hakala; Archontis Politis; Daniel Krause; Kazuki Shimada; Kengo Uchida; Naoya Takahashi; Parthasaarathy Sudarsanam; Sharath Adavanne; Shusuke Takahashi; Tuomas Virtanen

arxiv: 2306.09126 · v2 · pith:ZHHWHLOKnew · submitted 2023-06-15 · 💻 cs.SD · cs.CV· cs.MM· eess.AS· eess.IV

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Kazuki Shimada , Archontis Politis , Parthasaarathy Sudarsanam , Daniel Krause , Kengo Uchida , Sharath Adavanne , Aapo Hakala , Yuichiro Koyama

show 4 more authors

Naoya Takahashi Shusuke Takahashi Tuomas Virtanen Yuki Mitsufuji

This is my paper

classification 💻 cs.SD cs.CVcs.MMeess.ASeess.IV

keywords soundeventsaudio-visualdatastarss23arrayaudiomicrophone

0 comments

read the original abstract

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
cs.SD 2026-01 unverdicted novelty 7.0

TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.