Score: 0

Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

Published: December 16, 2025 | arXiv ID: 2512.13996v1

By: Can Jin , Hongwu Peng , Mingcan Xiang and more

Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.

Mixture of Group Experts for Learning Invariant Representations

Machine Learning (CS)

Makes AI smarter by teaching experts to work together.

12 Apr 2025 0

90%

Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks

Distributed, Parallel, and Cluster Computing

Makes smart devices learn faster with less power.

7 Dec 2025 0

90%

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

CV and Pattern Recognition

Makes AI draw better pictures by sorting image parts.

28 Oct 2025 1

View PDF Login to Bookmark

Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training

Technical Abstract

Mixture of Group Experts for Learning Invariant Representations

Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance