Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
By: Chuan He, Zhanwang Deng, Zhaosong Lu
Potential Business Impact:
Makes AI learn faster and better.
Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \cite{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose {\it low-rank orthogonalization}, which explicitly leverages the low-rank nature of gradients during NN training. Building on this, we propose low-rank matrix-signed gradient descent and a low-rank variant of Muon. Our numerical experiments demonstrate the superior performance of low-rank orthogonalization, with the low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the performance of the carefully tuned vanilla Muon. Theoretically, we establish the iteration complexity of the low-rank matrix-signed gradient descent for finding an approximate stationary solution, as well as that of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise.
Similar Papers
Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning
Artificial Intelligence
Makes computer training faster and better.
ROOT: Robust Orthogonalized Optimizer for Neural Network Training
Machine Learning (CS)
Makes AI learn better, even with messy data.
Iterative Orthogonalization Scaling Laws
Machine Learning (CS)
Makes computer learning faster and more stable.