Score: 1

Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

Published: March 3, 2025 | arXiv ID: 2503.01660v1

By: Thang Do, Arnulf Jentzen, Adrian Riekert

Potential Business Impact:

Makes AI learn better, even when it's not perfect.

Business Areas:

A/B Testing Data and Analytics

Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.

Sharp higher order convergence rates for the Adam optimizer

Optimization and Control

Makes computer learning faster and smarter.

28 Apr 2025 0

87%

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Machine Learning (CS)

Makes computer learning better with bigger data groups.

13 Oct 2025 0

87%

The Power of Random Features and the Limits of Distribution-Free Gradient Descent

Machine Learning (CS)

Shows why computers need data rules to learn.

15 May 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 🇩🇪 Hong Kong, Germany

Page Count

42 pages

Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

Makes AI learn better, even when it's not perfect.

Technical Abstract

Sharp higher order convergence rates for the Adam optimizer

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

The Power of Random Features and the Limits of Distribution-Free Gradient Descent