Score: 0

MEAN-RIR: Multi-Modal Environment-Aware Network for Robust Room Impulse Response Estimation

Published: September 5, 2025 | arXiv ID: 2509.05205v1

By: Jiajian Chen , Jiakang Chen , Hang Chen and more

Potential Business Impact:

Makes sounds clear by understanding room details.

Business Areas:
Internet Radio Media and Entertainment, Music and Audio

This paper presents a Multi-Modal Environment-Aware Network (MEAN-RIR), which uses an encoder-decoder framework to predict room impulse response (RIR) based on multi-level environmental information from audio, visual, and textual sources. Specifically, reverberant speech capturing room acoustic properties serves as the primary input, which is combined with panoramic images and text descriptions as supplementary inputs. Each input is processed by its respective encoder, and the outputs are fed into cross-attention modules to enable effective interaction between different modalities. The MEAN-RIR decoder generates two distinct components: the first component captures the direct sound and early reflections, while the second produces masks that modulate learnable filtered noise to synthesize the late reverberation. These two components are mixed to reconstruct the final RIR. The results show that MEAN-RIR significantly improves RIR estimation, with notable gains in acoustic parameters.

Country of Origin
🇨🇳 China

Page Count
7 pages

Category
Electrical Engineering and Systems Science:
Audio and Speech Processing