Score: 1

Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

Published: April 3, 2025 | arXiv ID: 2504.02988v1

By: Adrian S. Roman , Aiden Chang , Gerardo Meza and more

Potential Business Impact:

Makes robots hear and see where sounds come from.

Business Areas:

Visual Search Internet Services

We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community.

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Audio and Speech Processing

Helps computers understand sounds and sights together.

8 Sep 2025 0

88%

Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

Sound

Makes videos sound like you're really there.

18 Jun 2025 0

88%

Cross-Modal Knowledge Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection

Sound

Helps computers find sounds in videos better.

17 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

3 pages

Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

Makes robots hear and see where sounds come from.

Technical Abstract

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

Cross-Modal Knowledge Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection