A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization
By: Xuan Tang, Jichu Li, Difan Zou
Potential Business Impact:
Makes computers learn faster with less memory.
The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations. Our analysis further reveals that Adam is highly sensitive to weights and second-moment quantization due to its reliance on $\beta_2 \to 1$, while Muon requires weaker error control and is thus potentially more robust. These results narrow the gap between empirical success and theoretical understanding of low-precision training methods. Numerical experiments on synthetic and real-world data corroborate our theory.
Similar Papers
Scaling Laws for Floating Point Quantization Training
Machine Learning (CS)
Makes AI models run faster and use less power.
SGD Convergence under Stepsize Shrinkage in Low-Precision Training
Machine Learning (CS)
Makes computers learn faster with less memory.
SGD Convergence under Stepsize Shrinkage in Low-Precision Training
Machine Learning (CS)
Makes AI learn slower but use less power.