Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints
By: Seng Pei Liew, Kenta Shinzato, Yuyang Dong
Potential Business Impact:
Makes AI smarter by changing how it learns.
Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters ($N_{total}$) and expert sparsity ($s:=n_{exp}/n_{topk}$). Moreover, $n_{exp}$ and $n_{topk}$ do not "cancel out" within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes $N_{total}$ while minimizing $s$ (maximizing $n_{topk}$) and $n_{exp}$ under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design.
Similar Papers
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Machine Learning (CS)
Makes AI better at thinking, not just remembering.
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
Machine Learning (CS)
Makes smart computer programs use less memory.
Faster MoE LLM Inference for Extremely Large Models
Computation and Language
Makes AI faster by using fewer parts.