Score: 2

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

Published: October 4, 2025 | arXiv ID: 2510.03630v1

By: Xiluo He , Alexander Polok , Jesús Villalba and more

BigTech Affiliations: Johns Hopkins University

Potential Business Impact:

Makes talking robots understand many people at once.

Business Areas:

Speech Recognition Data and Analytics, Software

An increasingly common training paradigm for multi-talker automatic speech recognition (ASR) is to use speaker activity signals to adapt single-speaker ASR models for overlapping speech. Although effective, these systems require running the ASR model once per speaker, resulting in inference costs that scale with the number of speakers and limiting their practicality. In this work, we propose a method that decouples the inference cost of activity-conditioned ASR systems from the number of speakers by converting speaker-specific activity outputs into two speaker-agnostic streams. A central challenge is that na\"ively merging speaker activities into streams significantly degrades recognition, since pretrained ASR models assume contiguous, single-speaker inputs. To address this, we design new heuristics aimed at preserving conversational continuity and maintaining compatibility with existing systems. We show that our approach is compatible with Diarization-Conditioned Whisper (DiCoW) to greatly reduce runtimes on the AMI and ICSI meeting datasets while retaining competitive performance.