Score: 2

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Published: January 6, 2026 | arXiv ID: 2601.02967v2

By: Yishu Lei , Shuwei He , Jing Hu and more

BigTech Affiliations: Baidu

Potential Business Impact:

Lets computers understand sounds like speech and music.

Business Areas:

Audio Media and Entertainment, Music and Audio

Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. Furthermore, we will release the related code and models to facilitate future research.

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Sound

Helps computers understand different sounds better.

6 Jan 2026 2

91%

MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Computation and Language

Makes AI smarter by mixing different learning parts.

6 Jun 2025 1

91%

Steer-MoE: Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module

Sound

Makes computers understand sounds like humans.

15 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

13 pages

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Lets computers understand sounds like speech and music.

Technical Abstract

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Steer-MoE: Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module