A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point
By: Carlos Couto , José Mourão , Mário A. T. Figueiredo and more
Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
Similar Papers
Sharpness of Minima in Deep Matrix Factorization: Exact Expressions
Machine Learning (Stat)
Finds how well computer learning models train.
Towards Quantifying the Hessian Structure of Neural Networks
Machine Learning (CS)
Explains why computer brains learn better with many choices.
Phase diagram and eigenvalue dynamics of stochastic gradient descent in multilayer neural networks
Disordered Systems and Neural Networks
Helps computers learn better by finding the best settings.