Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection
By: Adrian S. Roman , Aiden Chang , Gerardo Meza and more
Potential Business Impact:
Makes robots hear and see where sounds come from.
We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community.
Similar Papers
Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Audio and Speech Processing
Helps computers understand sounds and sights together.
Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration
Sound
Makes videos sound like you're really there.
Cross-Modal Knowledge Distillation with Multi-Level Data Augmentation for Low-Resource Audio-Visual Sound Event Localization and Detection
Sound
Helps computers find sounds in videos better.