A Fast and Lightweight Model for Causal Audio-Visual Speech Separation
By: Wendi Sang , Kai Li , Runxuan Yang and more
Potential Business Impact:
Lets computers hear one voice in a noisy room.
Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.
Similar Papers
From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation
Sound
Cleans up voices in noisy videos using lip reading.
Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation
CV and Pattern Recognition
Cleans up noisy and overlapping voices.
Online Audio-Visual Autoregressive Speaker Extraction
Audio and Speech Processing
Helps computers hear one voice in noisy rooms.