Universal Approximation with Softmax Attention
By: Jerry Yao-Chieh Hu , Hude Liu , Hong-Yu Chen and more
Potential Business Impact:
Makes AI understand and create sequences better.
We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention's internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.
Similar Papers
Attention Mechanism, Max-Affine Partition, and Universal Approximation
Machine Learning (CS)
Lets computers learn any pattern from data.
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective
Machine Learning (CS)
Makes AI learn better with longer instructions.
Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
Machine Learning (CS)
Makes AI understand complex patterns better and faster.