Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
By: Duygu Altinok
Potential Business Impact:
Makes voice typing understand long talks better.
ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.
Similar Papers
Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER
Audio and Speech Processing
Fixes computer speech errors for clearer understanding.
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning
Audio and Speech Processing
Helps computers understand emotions in spoken words.
Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
Computation and Language
Listens better to long talks, even with noise.