Score: 2

Online Audio-Visual Autoregressive Speaker Extraction

Published: June 2, 2025 | arXiv ID: 2506.01270v1

By: Zexu Pan , Wupeng Wang , Shengkui Zhao and more

BigTech Affiliations: Alibaba

Potential Business Impact:

Helps computers hear one voice in noisy rooms.

Business Areas:
Speech Recognition Data and Analytics, Software

This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional 0.9 dB gain in terms of SI-SNRi, and its momentum is robust against the change in attention.

Country of Origin
🇨🇳 China

Page Count
5 pages

Category
Electrical Engineering and Systems Science:
Audio and Speech Processing