Score: 1

AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

Published: August 11, 2025 | arXiv ID: 2508.07608v1

By: Junxiao Xue , Xiaozhen Liu , Xuecheng Wu and more

Potential Business Impact:

Helps computers understand talking even with loud noise.

Audio-visual speech recognition (AVSR) combines audio-visual modalities to improve speech recognition, especially in noisy environments. However, most existing methods deploy the unidirectional enhancement or symmetric fusion manner, which limits their capability to capture heterogeneous and complementary correlations of audio-visual data-especially under asymmetric information conditions. To tackle these gaps, we introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement. Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives and intentionally establish asymmetry to support subsequent cross-modal interactions. The enhancement process involves two key components, Audio-aware Visual Refinement Module for enhanced visual representations under audio guidance, and Cross-modal Noise Suppression Masking Module which refines audio representations using visual cues, collaboratively leading to the closed-loop and bidirectional information flow. To further enhance correlation robustness, we adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs. Extensive experimental results on the LRS2 and LRS3 datasets indicate that our AD-AVSR consistently surpasses SOTA methods in both performance and noise robustness, highlighting the effectiveness of our model design.

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

Audio and Speech Processing

Helps computers understand talking in loud places.

18 Jan 2026 0

92%

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Audio and Speech Processing

Lets one computer understand talking from sound and sight.

10 Nov 2025 2

91%

Scalable Frameworks for Real-World Audio-Visual Speech Recognition

Audio and Speech Processing

Helps computers understand speech even with noise.

16 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

Helps computers understand talking even with loud noise.

Technical Abstract

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Scalable Frameworks for Real-World Audio-Visual Speech Recognition