Multi-channel multi-speaker transformer for speech recognition
By: Guo Yifan , Tian Yao , Suo Hongbin and more
Potential Business Impact:
Helps computers understand many people talking at once.
With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-to-end systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.
Similar Papers
Data-independent Beamforming for End-to-end Multichannel Multi-speaker ASR
Sound
Makes microphones hear one person in noisy rooms.
Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition
Sound
Helps computers understand emotions even with background noise.
Study of Lightweight Transformer Architectures for Single-Channel Speech Enhancement
Audio and Speech Processing
Makes phone calls clearer with less power.