ASR-Guided Speaker-Role Diarization and Diarization-Guided ASR Decoding
By: Arindam Ghosh , Mark Fuhs , Bongjun Kim and more
Potential Business Impact:
Identifies who is talking and what they say.
From an application standpoint, speaker-role diarization (RD), such as doctor vs. patient, host vs. guest, etc. is often more useful than traditional speaker diarization (SD), which assigns generic labels like speaker-1, speaker-2 etc. In the context of joint automatic speech recognition (ASR) + SD (who spoke what?), recent end-to-end models employ an auxiliary SD transducer, synchronized with the ASR transducer, to predict speakers per word. In this paper, we extend this framework to RD with three key contributions: (1) we simplify the training via forced alignment and cross-entropy loss instead of RNNT loss, (2) we show that word prediction and role prediction require different amounts of predictor's context, leading to separate task-specific predictors, unlike existing shared-predictor models, and (3) we propose a way to leverage RD posterior activity to influence ASR decoding and reduce small-word deletion errors.
Similar Papers
SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models
Sound
Identifies who spoke what in audio.
Joint ASR and Speaker Role Tagging with Serialized Output Training
Audio and Speech Processing
Lets computers know who is talking in a conversation.
Multi-Stage Speaker Diarization for Noisy Classrooms
Sound
Helps computers know who spoke in noisy classrooms.