Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models
By: Tatyana Matveeva, Aleksandr Katrutsa, Evgeny Frolov
Potential Business Impact:
Makes computer learning faster with smarter math.
Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods, approximating the exact Hessian, can model these correlations and may enable faster convergence. At the same time, their computational and memory costs are often prohibitive for large-scale models. To address this limitation, we propose AdaGram, an optimizer that enables efficient full-matrix adaptive gradient updates. To reduce memory and computational overhead, we utilize fast symmetric factorization for computing the preconditioned update direction at each iteration. Additionally, we maintain the low-rank structure of a preconditioner along the optimization trajectory using matrix integrator methods. Numerical experiments on standard machine learning tasks show that AdaGram converges faster or matches the performance of diagonal adaptive optimizers when using rank five and smaller rank approximations. This demonstrates AdaGram's potential as a scalable solution for adaptive optimization in large models.
Similar Papers
Efficient Low-Tubal-Rank Tensor Estimation via Alternating Preconditioned Gradient Descent
Machine Learning (CS)
Makes computer math problems solve much faster.
Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
Machine Learning (CS)
Makes computer learning faster and more accurate.
Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization
Optimization and Control
Makes computer learning faster when it's too complex.