A dynamic view of some anomalous phenomena in SGD
By: Vivek Shripad Borkar
Potential Business Impact:
Helps computers learn better by finding hidden patterns.
It has been observed by Belkin et al.\ that over-parametrized neural networks exhibit a `double descent' phenomenon. That is, as the model complexity (as reflected in the number of features) increases, the test error initially decreases, then increases, and then decreases again. A counterpart of this phenomenon in the time domain has been noted in the context of epoch-wise training, viz., the test error decreases with the number of iterates, then increases, then decreases again. Another anomalous phenomenon is that of \textit{grokking} wherein two regimes of descent are interrupted by a third regime wherein the mean loss remains almost constant. This note presents a plausible explanation for these and related phenomena by using the theory of two time scale stochastic approximation, applied to the continuous time limit of the gradient dynamics. This gives a novel perspective for an already well studied theme.
Similar Papers
The Double Descent Behavior in Two Layer Neural Network for Binary Classification
Machine Learning (Stat)
Finds a sweet spot for computer learning accuracy.
A Two-Phase Perspective on Deep Learning Dynamics
High Energy Physics - Theory
Helps computers learn better by forgetting some things.
Emergence and scaling laws in SGD learning of shallow neural networks
Machine Learning (CS)
Teaches computers to learn complex patterns faster.