Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification
By: Kazuki Shimada , Archontis Politis , Iran R. Roman and more
Potential Business Impact:
Helps computers find sounds in stereo recordings.
This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360{\deg} audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
Similar Papers
Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos
Audio and Speech Processing
Finds sounds and their direction in videos.
Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Audio and Speech Processing
Helps computers understand sounds and sights together.
A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection
Sound
Finds where sounds come from in 3D.