Score: 0

Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation

Published: August 25, 2025 | arXiv ID: 2508.17796v1

By: Changsong Liu, Yizhou Peng, Eng Siong Chng

Potential Business Impact:

Helps computers understand rare words in speech.

Business Areas:
Speech Recognition Data and Analytics, Software

Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. After which, any recognized variant is mapped back to the original rare word in the final transcription. The evaluation results on the Librispeech dataset show that our method reduces biased word error rate (WER) by 42% on test-clean and 43% on test-other while maintaining unbiased WER essentially unchanged.

Country of Origin
πŸ‡ΈπŸ‡¬ Singapore

Page Count
6 pages

Category
Computer Science:
Computation and Language