A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection
By: Hogeon Yu
Potential Business Impact:
Finds where sounds come from in 3D.
Sound Event Localization and Detection (SELD) is crucial in spatial audio processing, enabling systems to detect sound events and estimate their 3D directions. Existing SELD methods use single- or dual-branch architectures: single-branch models share SED and DoA representations, causing optimization conflicts, while dual-branch models separate tasks but limit information exchange. To address this, we propose a two-step learning framework. First, we introduce a tracwise reordering format to maintain temporal consistency, preventing event reassignments across tracks. Next, we train SED and DoA networks to prevent interference and ensure task-specific feature learning. Finally, we effectively fuse DoA and SED features to enhance SELD performance with better spatial and event representation. Experiments on the 2023 DCASE challenge Task 3 dataset validate our framework, showing its ability to overcome single- and dual-branch limitations and improve event classification and localization.
Similar Papers
An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation
Sound
Pinpoints sound location in 3D space.
A Robust framework for sound event localization and detection on real recordings
Sound
Finds sounds and where they come from.
Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Audio and Speech Processing
Helps computers understand sounds and sights together.