Score: 0

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

Published: March 9, 2025 | arXiv ID: 2503.06362v2

By: Umberto Cappellazzo, Minsu Kim, Stavros Petridis

Potential Business Impact:

Lets computers understand speech better, even with noise.

Business Areas:
Speech Recognition Data and Analytics, Software

Audio-Visual Speech Recognition (AVSR) leverages audio and visual modalities to improve robustness in noisy environments. Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including AVSR. However, the long speech representations lead to high computational costs for LLMs. Prior methods compress inputs before feeding them to LLMs, but high compression often harms accuracy. To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which flexibly adapts audio-visual token allocation under varying compute constraints. Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture, avoiding the need for separate models. For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules. Evaluations on major AVSR datasets show Llama-MTSK matches or outperforms models trained at fixed compression levels.

Page Count
9 pages

Category
Computer Science:
CV and Pattern Recognition