Muon is Provably Faster with Momentum Variance Reduction
By: Xun Qian , Hussein Rammal , Dmitry Kovalev and more
Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from ${\cal O} (\frac{1}{K^{1/4}})$ to ${\cal O} (\frac{1}{K^{1/3}})$. Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.
Similar Papers
On the Convergence of Muon and Beyond
Machine Learning (CS)
Makes AI learn faster and better.
Better LMO-based Momentum Methods with Second-Order Information
Optimization and Control
Makes computer learning faster and better.
An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants
Machine Learning (CS)
Makes computer learning faster and more reliable.