Peeking Into The Future For Contextual Biasing
By: Ramaneswaran Selvakumar , Cindy Tseng , Eesung Kim and more
Potential Business Impact:
Helps voice assistants understand names better.
While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.
Similar Papers
A Neural Model for Contextual Biasing Score Learning and Filtering
Audio and Speech Processing
Helps voice assistants understand you better.
Improving Named Entity Transcription with Contextual LLM-based Revision
Computation and Language
Fixes computer speech errors for important names.
CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models
Audio and Speech Processing
Helps computers understand many people talking at once.