Online Audio-Visual Autoregressive Speaker Extraction
By: Zexu Pan , Wupeng Wang , Shengkui Zhao and more
Potential Business Impact:
Helps computers hear one voice in noisy rooms.
This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional 0.9 dB gain in terms of SI-SNRi, and its momentum is robust against the change in attention.
Similar Papers
Visual-Aware Speech Recognition for Noisy Scenarios
Computation and Language
Helps computers hear speech in noisy places.
From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation
Sound
Cleans up voices in noisy videos using lip reading.
Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior
Audio and Speech Processing
Cleans up noisy audio to hear voices better.