Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition
By: Isha Pandey , Ashish Mittal , Vartul Bahuguna and more
Potential Business Impact:
Helps computers understand many languages spoken.
Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.
Similar Papers
Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
Audio and Speech Processing
Helps computers understand speech even in loud places.
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
Computation and Language
Makes AI smarter, faster, and use less memory.
Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering
Computation and Language
Makes computer languages understand each other better.