On the Convergence of Muon and Beyond
By: Da Chang, Yongxiang Liu, Ganzhao Yuan
Potential Business Impact:
Makes AI learn faster and better.
The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap persists between its practical performance and theoretical understanding. Existing analyses indicate that the standard Muon variant achieves only a suboptimal convergence rate of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To explore the theoretical limits of the Muon framework, we develop and analyze two momentum-based variance-reduced variants: a one-batch version (Muon-MVR1) and a two-batch version (Muon-MVR2). We provide the first rigorous proof that incorporating a variance-reduction mechanism enables Muon-MVR2 to attain an optimal convergence rate of $\tilde{\mathcal{O}}(T^{-1/3})$, thereby matching the theoretical lower bound for this class of problems. Moreover, our analysis establishes convergence guarantees for Muon variants under the Polyak-{\L}ojasiewicz (P{\L}) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work provides the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.
Similar Papers
MARS-M: When Variance Reduction Meets Matrices
Machine Learning (CS)
Makes AI learn much faster and better.
MARS-M: When Variance Reduction Meets Matrices
Machine Learning (CS)
Makes AI learn faster and better.
LiMuon: Light and Fast Muon Optimizer for Large Models
Machine Learning (CS)
Makes AI models train faster with less memory.