Score: 1

Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

Published: April 3, 2025 | arXiv ID: 2504.02988v1

By: Adrian S. Roman , Aiden Chang , Gerardo Meza and more

Potential Business Impact:

Makes robots hear and see where sounds come from.

Business Areas:
Visual Search Internet Services

We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community.

Repos / Data Links

Page Count
3 pages

Category
Computer Science:
Sound