Score: 0

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Published: October 13, 2025 | arXiv ID: 2510.11354v1

By: Xuan Tang , Han Zhang , Yuan Cao and more

Potential Business Impact:

Makes computer learning better with bigger data groups.

Business Areas:

A/B Testing Data and Analytics

Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $\lambda$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $\lambda$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam's generalization performance.

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

Machine Learning (CS)

Makes computer language learning faster and better.

14 Jun 2025 0

87%

Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

Machine Learning (CS)

Makes AI learn better, even when it's not perfect.

3 Mar 2025 1

87%

Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam

Machine Learning (CS)

Improves how computer learning programs learn faster.

7 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

71 pages

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Makes computer learning better with bigger data groups.

Technical Abstract

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam