Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
By: Xinlu He, Jacob Whitehill
Potential Business Impact:
Helps computers understand many people talking at once.
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
Similar Papers
Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation
Audio and Speech Processing
Makes computers understand spoken words better.
Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
Audio and Speech Processing
Makes talking computers understand many people talking at once.
Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction
Computation and Language
Helps voice search understand movie titles better.