Score: 0

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Published: October 29, 2025 | arXiv ID: 2510.26825v1

By: Jiarong Du , Zhan Jin , Peijun Yang and more

Potential Business Impact:

Cleans up noisy speech using sight and sound.

Business Areas:

Speech Recognition Data and Analytics, Software

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Sound

Cleans up noisy phone calls using sound and faces.

11 Aug 2025 0

89%

AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

Multimedia

Helps computers understand talking even with loud noise.

11 Aug 2025 1

89%

End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments

Audio and Speech Processing

Helps deaf people hear better in noisy places.

19 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

5 pages

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Cleans up noisy speech using sight and sound.

Technical Abstract

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments