Index-MSR: A high-efficiency multimodal fusion framework for speech recognition
By: Jinming Chen , Lu Wang , Zheshu Song and more
Potential Business Impact:
Makes talking computers understand videos better.
Driven by large scale datasets and LLM based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significantly. In this work, we present Index-MSR, an efficient multimodal speech recognition framework. At its core is a novel Multimodal Fusion Decoder (MFD), which effectively incorporates text-related information from videos (e.g., subtitles and presentation slides) into the speech recognition. This cross-modal integration not only enhances overall ASR accuracy but also yields substantial reductions in substitution errors. Extensive evaluations on both an in-house subtitle dataset and a public AVSR dataset demonstrate that Index-MSR achieves sota accuracy, with substitution errors reduced by 20,50%. These results demonstrate that our approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential in applications requiring strict audio text synchronization, such as audio translation.
Similar Papers
Index-ASR Technical Report
Sound
Makes voice assistants understand better, less mistakes.
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
CV and Pattern Recognition
Lets computers understand talking better, even with noise.
Scalable Frameworks for Real-World Audio-Visual Speech Recognition
Audio and Speech Processing
Helps computers understand speech even with noise.