Revisiting Transformers with Insights from Image Filtering
By: Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen
Potential Business Impact:
Makes AI understand pictures and words better.
The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.
Similar Papers
Attention-Only Transformers via Unrolled Subspace Denoising
Machine Learning (CS)
Makes AI understand things better with fewer parts.
Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI
Machine Learning (CS)
Helps AI understand what's important in pictures.
Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions
CV and Pattern Recognition
Makes AI art creation much faster and cheaper.