Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
By: Duygu Altinok
Potential Business Impact:
Makes voice typing understand long talks better.
ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.
Similar Papers
Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
Computation and Language
Listens better to long talks, even with noise.
Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment
Computation and Language
Helps computers judge how well people speak English.
Context-Aware Whisper for Arabic ASR Under Linguistic Varieties
Computation and Language
Helps computers understand different Arabic accents better.