Attention Mechanism, Max-Affine Partition, and Universal Approximation
By: Hude Liu , Jerry Yao-Chieh Hu , Zhao Song and more
Potential Business Impact:
Lets computers learn any pattern from data.
We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.
Similar Papers
Universal Approximation with Softmax Attention
Machine Learning (CS)
Makes AI understand and create sequences better.
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Machine Learning (CS)
Helps computers understand different kinds of information together.
The Effect of Attention Head Count on Transformer Approximation
Machine Learning (CS)
More "attention heads" make AI understand better.