Score: 1

Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts

Published: September 5, 2025 | arXiv ID: 2509.10530v1

By: Cheng Li , Jiexiong Liu , Yixuan Chen and more

Potential Business Impact:

Makes computers understand long stories better.

Business Areas:

A/B Testing Data and Analytics

Transformer models based on the Mixture of Experts (MoE) architecture have made significant progress in long-sequence modeling, but existing models still have shortcomings in computational efficiency and the ability to capture long-range dependencies, especially in terms of the dynamic adaptability of expert resource allocation. In this paper, we propose a Dynamic Adaptive Shared Expert and Grouped Multi-Head Attention Hybrid Model (DASG-MoE) to enhance long-sequence modeling capabilities by integrating three modules. First, we employ the Grouped Multi-Head Attention (GMHA) mechanism to effectively reduce the computational complexity of long sequences. By parallel processing through sequence grouping, local sliding window attention, and feature aggregation, we address long-range dependency issues and the model's lack of generalization for local information. Second, we design a Dual-Scale Shared Expert Structure (DSSE), where shallow experts use lightweight computations to quickly respond to low-dimensional features, while deep experts process high-dimensional complex semantics through pre-training transfer and post-training optimization, achieving a dynamic balance between efficiency and accuracy. Third, we propose a hierarchical Adaptive Dynamic Routing (ADR) mechanism that dynamically selects expert levels based on feature complexity and task requirements, and optimizes resource allocation through a local expert activation strategy. Experiments on multiple long-sequence benchmark datasets demonstrate that our DASG-MoE model outperforms state-of-the-art models.

MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts

Computation and Language

Writes long stories and code much faster.

23 Dec 2025 0

89%

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Computation and Language

Makes AI remember more without using more computer power.

16 Jun 2025 2

88%

Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures

Computation and Language

Computers solve problems faster and smarter.

24 Sep 2025 0

View PDF Login to Bookmark

Page Count

16 pages

Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts

Makes computers understand long stories better.

Technical Abstract

MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures