Score: 0

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Published: January 12, 2026 | arXiv ID: 2601.07326v1

By: Huan Li, Yiming Dong, Zhouchen Lin

Potential Business Impact:

Makes computer learning faster and better.

Business Areas:
A/B Testing Data and Analytics

This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $\|\nabla f(X)\|_F\leq \|\nabla f(X)\|_*\leq \sqrt{m+n}\|\nabla f(X)\|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(X_k)\|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $\|\nabla f(X)\|_*= Θ(\sqrt{m+n})\|\nabla f(X)\|_F$.

Country of Origin
🇨🇳 China

Page Count
18 pages

Category
Mathematics:
Optimization and Control