Score: 0

A dynamic view of some anomalous phenomena in SGD

Published: May 3, 2025 | arXiv ID: 2505.01751v3

By: Vivek Shripad Borkar

Potential Business Impact:

Helps computers learn better by finding hidden patterns.

Business Areas:
A/B Testing Data and Analytics

It has been observed by Belkin et al.\ that over-parametrized neural networks exhibit a `double descent' phenomenon. That is, as the model complexity (as reflected in the number of features) increases, the test error initially decreases, then increases, and then decreases again. A counterpart of this phenomenon in the time domain has been noted in the context of epoch-wise training, viz., the test error decreases with the number of iterates, then increases, then decreases again. Another anomalous phenomenon is that of \textit{grokking} wherein two regimes of descent are interrupted by a third regime wherein the mean loss remains almost constant. This note presents a plausible explanation for these and related phenomena by using the theory of two time scale stochastic approximation, applied to the continuous time limit of the gradient dynamics. This gives a novel perspective for an already well studied theme.

Page Count
8 pages

Category
Mathematics:
Optimization and Control