Score: 0

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

Published: August 19, 2025 | arXiv ID: 2508.13624v1

By: Rong Chao , Wenze Ren , You-Jin Li and more

Potential Business Impact:

Helps computers hear one voice in noisy crowds.

Business Areas:
Speech Recognition Data and Analytics, Software

Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \textbf{1st place} on the monaural leaderboard.

Country of Origin
🇹🇼 Taiwan, Province of China

Page Count
2 pages

Category
Computer Science:
Sound