Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts
By: Ye Su, Yong Liu
Potential Business Impact:
Makes AI smarter by choosing the right brain parts.
Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a "Coherence Barrier"; when expert representations exhibit high mutual coherence, greedy routing strategies theoretically fail to recover the optimal expert subset. Importantly, we formally verify that imposing geometric orthogonality in the expert feature space is sufficient to narrow the divide between the NP-hard global optimum and polynomial-time greedy approximation. Our comparative analyses confirm orthogonality regularization as the optimal engineering relaxation for large-scale models. Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.
Similar Papers
Geometric Regularization in Mixture-of-Experts: The Disconnect Between Weights and Activations
Machine Learning (CS)
Makes AI models learn better by making them unique.
Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts
Computation and Language
AI learns better by remembering past answers.
Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation
Machine Learning (CS)
Makes AI faster and use less computer memory.