Score: 0

A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection

Published: July 30, 2025 | arXiv ID: 2507.22322v1

By: Hogeon Yu

Potential Business Impact:

Finds where sounds come from in 3D.

Business Areas:

Audio Media and Entertainment, Music and Audio

Sound Event Localization and Detection (SELD) is crucial in spatial audio processing, enabling systems to detect sound events and estimate their 3D directions. Existing SELD methods use single- or dual-branch architectures: single-branch models share SED and DoA representations, causing optimization conflicts, while dual-branch models separate tasks but limit information exchange. To address this, we propose a two-step learning framework. First, we introduce a tracwise reordering format to maintain temporal consistency, preventing event reassignments across tracks. Next, we train SED and DoA networks to prevent interference and ensure task-specific feature learning. Finally, we effectively fuse DoA and SED features to enhance SELD performance with better spatial and event representation. Experiments on the 2023 DCASE challenge Task 3 dataset validate our framework, showing its ability to overcome single- and dual-branch limitations and improve event classification and localization.

An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation

Sound

Pinpoints sound location in 3D space.

18 Jan 2025 0

92%

A Robust framework for sound event localization and detection on real recordings

Sound

Finds sounds and where they come from.

16 Dec 2025 0

92%

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Audio and Speech Processing

Helps computers understand sounds and sights together.

8 Sep 2025 0

View PDF Login to Bookmark

Page Count

5 pages

A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection

Finds where sounds come from in 3D.

Technical Abstract

An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation

A Robust framework for sound event localization and detection on real recordings

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos