Score: 0

A Robust framework for sound event localization and detection on real recordings

Published: December 16, 2025 | arXiv ID: 2512.22156v1

By: Jin Sob Kim , Hyun Joon Park , Wooseok Shin and more

Potential Business Impact:

Finds sounds and where they come from.

Business Areas:

Speech Recognition Data and Analytics, Software

This technical report describes the systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD). The task aims to detect occurrences of sound events and specify their class, furthermore estimate their position. Our system utilizes a ResNet-based model under a proposed robust framework for SELD. To guarantee the generalized performance on the real-world sound scenes, we design the total framework with augmentation techniques, a pipeline of mixing datasets from real-world sound scenes and emulations, and test time augmentation. Augmentation techniques and exploitation of external sound sources enable training diverse samples and keeping the opportunity to train the real-world context enough by maintaining the number of the real recording samples in the batch. In addition, we design a test time augmentation and a clustering-based model ensemble method to aggregate confident predictions. Experimental results show that the model under a proposed framework outperforms the baseline methods and achieves competitive performance in real-world sound recordings.

Resnet-conformer network with shared weights and attention mechanism for sound event localization, detection, and distance estimation

Sound

Helps computers pinpoint sounds in noisy places.

23 Jul 2025 0

92%

A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection

Sound

Finds where sounds come from in 3D.

30 Jul 2025 0

91%

Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos

Audio and Speech Processing

Finds sounds and their direction in videos.

7 Jul 2025 0

View PDF Login to Bookmark

Page Count

4 pages

A Robust framework for sound event localization and detection on real recordings

Finds sounds and where they come from.

Technical Abstract

Resnet-conformer network with shared weights and attention mechanism for sound event localization, detection, and distance estimation

A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection

Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos