Score: 0

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Published: October 13, 2025 | arXiv ID: 2510.11354v1

By: Xuan Tang , Han Zhang , Yuan Cao and more

Potential Business Impact:

Makes computer learning better with bigger data groups.

Business Areas:
A/B Testing Data and Analytics

Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $\lambda$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $\lambda$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam's generalization performance.

Country of Origin
🇭🇰 Hong Kong

Page Count
71 pages

Category
Computer Science:
Machine Learning (CS)