Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks
By: Thang Do, Arnulf Jentzen, Adrian Riekert
Potential Business Impact:
Makes AI learn better, even when it's not perfect.
Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.
Similar Papers
Sharp higher order convergence rates for the Adam optimizer
Optimization and Control
Makes computer learning faster and smarter.
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks
Machine Learning (CS)
Makes computer learning better with bigger data groups.
The Power of Random Features and the Limits of Distribution-Free Gradient Descent
Machine Learning (CS)
Shows why computers need data rules to learn.