Towards Spatial Audio Understanding via Question Answering
By: Parthasaarathy Sudarsanam, Archontis Politis
Potential Business Impact:
Lets computers understand where sounds come from.
In this paper, we introduce a novel framework for spatial audio understanding of first-order ambisonic (FOA) signals through a question answering (QA) paradigm, aiming to extend the scope of sound event localization and detection (SELD) towards spatial scene understanding and reasoning. First, we curate and release fine-grained spatio-temporal textual descriptions for the STARSS23 dataset using a rule-based approach, and further enhance linguistic diversity using large language model (LLM)-based rephrasing. We also introduce a QA dataset aligned with the STARSS23 scenes, covering various aspects such as event presence, localization, spatial, and temporal relationships. To increase language variety, we again leverage LLMs to generate multiple rephrasings per question. Finally, we develop a baseline spatial audio QA model that takes FOA signals and natural language questions as input and provides answers regarding various occurrences, temporal, and spatial relationships of sound events in the scene formulated as a classification task. Despite being trained solely with scene-level question answering supervision, our model achieves performance that is comparable to a fully supervised sound event localization and detection model trained with frame-level spatiotemporal annotations. The results highlight the potential of language-guided approaches for spatial audio understanding and open new directions for integrating linguistic supervision into spatial scene analysis.
Similar Papers
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Sound
Teaches computers to answer questions about sounds.
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
Sound
Helps computers hear sounds from different directions.
Spatial Audio Motion Understanding and Reasoning
Sound
Lets computers hear where sounds are coming from.