On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm
By: Huan Li, Yiming Dong, Zhouchen Lin
Potential Business Impact:
Makes AI learn faster and better.
As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(x^k)||_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $||\nabla f(x)||_2\ll ||\nabla f(x)||_1\leq \sqrt{d}||\nabla f(x)||_2$ for any high-dimensional vector $x$ and $E\left[||\nabla f(x)||_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[||\nabla f(x)||_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $||\nabla f(x)||_1=\varTheta(\sqrt{d})||\nabla f(x)||_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(x^k)||_2\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD.
Similar Papers
Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning
Optimization and Control
Makes computer learning faster and better.
Sharp higher order convergence rates for the Adam optimizer
Optimization and Control
Makes computer learning faster and smarter.
Simple Convergence Proof of Adam From a Sign-like Descent Perspective
Machine Learning (CS)
Makes computer learning faster and more reliable.