Geometric Regularization in Mixture-of-Experts: The Disconnect Between Weights and Activations
By: Hyunjun Kim
Potential Business Impact:
Makes AI models learn better by making them unique.
Mixture-of-Experts (MoE) models achieve efficiency through sparse activation, but the role of geometric regularization in expert specialization remains unclear. We apply orthogonality loss to enforce expert diversity and find it fails on multiple fronts: it does not reduce weight-space overlap (MSO actually increases by up to 114%), activation-space overlap remains high (~0.6) regardless of regularization, and effects on performance are inconsistent -- marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and highly variable results on PTB (std > 1.0). Our analysis across 7 regularization strengths reveals no significant correlation (r = -0.293, p = 0.523) between weight and activation orthogonality. These findings demonstrate that weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.
Similar Papers
Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts
Machine Learning (CS)
Makes AI smarter by choosing the right brain parts.
Advancing Expert Specialization for Better MoE
Computation and Language
Makes AI smarter by teaching experts to focus.
Mixture of Group Experts for Learning Invariant Representations
Machine Learning (CS)
Makes AI smarter by teaching experts to work together.