Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
By: Umberto Cappellazzo, Minsu Kim, Stavros Petridis
Potential Business Impact:
Lets computers understand speech better, even with noise.
Audio-Visual Speech Recognition (AVSR) leverages audio and visual modalities to improve robustness in noisy environments. Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including AVSR. However, the long speech representations lead to high computational costs for LLMs. Prior methods compress inputs before feeding them to LLMs, but high compression often harms accuracy. To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which flexibly adapts audio-visual token allocation under varying compute constraints. Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture, avoiding the need for separate models. For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules. Evaluations on major AVSR datasets show Llama-MTSK matches or outperforms models trained at fixed compression levels.
Similar Papers
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
Audio and Speech Processing
Lets one computer understand talking from sound and sight.
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
CV and Pattern Recognition
Lets computers understand talking better, even with noise.
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
Audio and Speech Processing
Lets computers understand talking and faces better.