Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts
By: Haoyuan Wu , Haoxing Chen , Xiaodong Chen and more
Potential Business Impact:
Makes AI smarter by using different sized "brains."
The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.
Similar Papers
ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts
Computation and Language
Makes smart computer programs learn better and faster.
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
Computation and Language
Makes AI smarter, faster, and use less memory.
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting
Distributed, Parallel, and Cluster Computing
Makes AI models run much faster and smoother.