MicroMoE: Fine-Grained Load Balancing for Mixture-of-Experts with Token Scheduling
By: Chenqi Zhao , Wenfei Wu , Linhai Song and more
Potential Business Impact:
Makes AI learn faster by balancing computer work.
Mixture-of-Experts (MoE) has emerged as a promising approach to scale up deep learning models due to its significant reduction in computational resources. However, the dynamic nature of MoE leads to load imbalance among experts, severely impacting training efficiency. While previous research has attempted to address the load balancing challenge, existing solutions either compromise model accuracy or introduce additional system overhead. As a result, they fail to achieve fine-grained load balancing, which is crucial to optimizing training efficiency. We propose MicroEP, a novel parallelization strategy to achieve fine-grained load balancing in MoE systems. MicroEP is capable of achieving optimal load balancing in every micro-batch through efficient token scheduling across GPUs. Furthermore, we propose MicroMoE, an efficient distributed MoE training system with MicroEP's load balancing capabilities. Our experimental results demonstrate that MicroMoE improves the end-to-end training throughput by up to 47.6% compared with the state-of-the-art system, and almost consistently achieves optimal load balance among GPUs.
Similar Papers
MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training
Distributed, Parallel, and Cluster Computing
Trains big AI models on less computer memory.
Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
Distributed, Parallel, and Cluster Computing
Makes AI models run faster by sharing work better.
ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models
Distributed, Parallel, and Cluster Computing
Lets big AI models grow and shrink instantly.