Score: 1

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Published: July 22, 2025 | arXiv ID: 2507.16343v1

By: Pengfei Cai , Yan Song , Qing Gu and more

Potential Business Impact:

Lets computers hear any sound, even new ones.

Business Areas:

Speech Recognition Data and Analytics, Software

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.

FlexSED: Towards Open-Vocabulary Sound Event Detection

Audio and Speech Processing

Finds specific sounds from any description.

23 Sep 2025 1

90%

Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation

Sound

Helps computers hear sounds in noisy places.

10 Aug 2025 1

88%

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Audio and Speech Processing

Helps computers understand sounds and sights together.

8 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇸🇬 China, Singapore

Page Count

10 pages

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Lets computers hear any sound, even new ones.

Technical Abstract

FlexSED: Towards Open-Vocabulary Sound Event Detection

Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos