Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws
By: Gérard Ben Arous , Murat A. Erdogdu , N. Mert Vural and more
Potential Business Impact:
Teaches computers to learn patterns faster.
We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $y \propto \sum_{j=1}^{r}\lambda_j \sigma\left(\langle \boldsymbol{\theta_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$, $\sigma$ is the 2nd Hermite polynomial, and $\lbrace\boldsymbol{\theta}_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^\beta$ for $\beta \in [0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $\lambda_j\asymp j^{-\alpha}$ for $\alpha \geq 0$. We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.
Similar Papers
Emergence and scaling laws in SGD learning of shallow neural networks
Machine Learning (CS)
Teaches computers to learn complex patterns faster.
Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit
Machine Learning (Stat)
Teaches computers to learn hidden patterns faster.
Universality of high-dimensional scaling limits of stochastic gradient descent
Machine Learning (Stat)
Makes computer learning work even with messy data.