Integral Transformer: Denoising Attention, Not Too Much Not Too Little
By: Ivan Kobyzev , Abbas Ghaddar , Dingtao Hu and more
Potential Business Impact:
Cleans up computer language understanding for better results.
Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.
Similar Papers
Attention-Only Transformers via Unrolled Subspace Denoising
Machine Learning (CS)
Makes AI understand things better with fewer parts.
CTLformer: A Hybrid Denoising Model Combining Convolutional Layers and Self-Attention for Enhanced CT Image Reconstruction
Image and Video Processing
Cleans up blurry medical scans for better health checks.
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
Computation and Language
Makes computer language models work better and simpler.