Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
By: Shai Zucker , Xiong Wang , Fei Lu and more
Potential Business Impact:
Makes AI learn faster, no matter how much data.
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ with $M$ being the sample size, depending only on the smoothness $\beta$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
Similar Papers
Minimax Rates for the Estimation of Eigenpairs of Weighted Laplace-Beltrami Operators on Manifolds
Machine Learning (Stat)
Helps computers find hidden patterns in data.
Adversarial learning for nonparametric regression: Minimax rate and adaptive estimation
Machine Learning (Stat)
Protects computers from tricky, fake data.
Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions
Machine Learning (Stat)
Helps AI learn better by understanding how it works.