Attention-Only Transformers via Unrolled Subspace Denoising
By: Peng Wang , Yifu Lu , Yaodong Yu and more
Potential Business Impact:
Makes AI understand things better with fewer parts.
Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.
Similar Papers
Revisiting Transformers with Insights from Image Filtering
CV and Pattern Recognition
Makes AI understand pictures and words better.
Integral Transformer: Denoising Attention, Not Too Much Not Too Little
Computation and Language
Cleans up computer language understanding for better results.
Attention Layers Add Into Low-Dimensional Residual Subspaces
Machine Learning (CS)
Makes AI understand things better by fixing its "dead features."