Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation
By: Yuanjian Chen , Yang Xiao , Han Yin and more
Potential Business Impact:
Helps computers hear sounds in noisy places.
Most sound event detection (SED) systems perform well on clean datasets but degrade significantly in noisy environments. Language-queried audio source separation (LASS) models show promise for robust SED by separating target events; existing methods require elaborate multi-stage training and lack explicit guidance for target events. To address these challenges, we introduce event appearance detection (EAD), a counting-based approach that counts event occurrences at both the clip and frame levels. Based on EAD, we propose a co-training-based multi-task learning framework for EAD and SED to enhance SED's performance in noisy environments. First, SED struggles to learn the same patterns as EAD. Then, a task-based constraint is designed to improve prediction consistency between SED and EAD. This framework provides more reliable clip-level predictions for LASS models and strengthens timestamp detection capability. Experiments on DESED and WildDESED datasets demonstrate better performance compared to existing methods, with advantages becoming more pronounced at higher noise levels.
Similar Papers
Sound Event Detection with Boundary-Aware Optimization and Inference
Audio and Speech Processing
Finds exact start and end of sounds.
A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection
Sound
Finds where sounds come from in 3D.
Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
Sound
Lets computers hear any sound, even new ones.