Score: 3

HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

Published: August 13, 2025 | arXiv ID: 2508.09591v1

By: Wenxiang Lin , Xinglin Pan , Lin Zhang and more

Potential Business Impact:

Makes AI learn much faster and use less power.

The sparsely activated mixture-of-experts (MoE) transformer has become a common architecture for large language models (LLMs) due to its sparsity, which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial communication and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the communication traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build theoretical models aimed at achieving the best token duplication and expert swap strategy under different model configurations and hardware environments. We implement our prototype HierMoE system atop Megatron-LM and conduct experiments on a 32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results show that our HierMoE achieves $1.55\times$ to $3.32\times$ faster communication and delivers $1.18\times$ to $1.27\times$ faster end-to-end training compared to state-of-the-art MoE training systems, Tutel-2DH, SmartMoE, and Megatron-LM.

Accelerating Mixture-of-Experts Training with Adaptive Expert Replication

Distributed, Parallel, and Cluster Computing

Makes AI learn much faster without wasting power.

28 Apr 2025 3

91%

DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction

Machine Learning (CS)

Makes smart computer programs run faster and better.

25 Aug 2025 0

91%

Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling

Machine Learning (CS)

Makes AI answer questions much faster.

6 Mar 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇭🇰 China, Hong Kong

Repos / Data Links

github.com

Page Count

11 pages

HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

Makes AI learn much faster and use less power.

Technical Abstract

Accelerating Mixture-of-Experts Training with Adaptive Expert Replication

DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction

Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling