Attention as an Adaptive Filter
By: Peter Racioppo
Potential Business Impact:
Makes computers understand patterns in data better.
We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By imposing a linear dynamics model with simultaneously diagonalizable state matrices and noise covariances, we can make use of a closed-form solution to the differential Lyapunov equation to efficiently propagate pairwise uncertainties through the dynamics. Attention naturally arises as the maximum likelihood solution for this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated pairwise precisions. Imposing an additional constraint on the state matrix's eigenvalues leads to a simplified variant with the same computational and memory complexity as standard attention. In the limit of vanishing dynamics and process noise, and using a small-angle approximation, we recover ordinary dot-product attention.
Similar Papers
Fractional neural attention for efficient multiscale sequence processing
Machine Learning (CS)
Makes AI smarter by copying how brains pay attention.
GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
Machine Learning (CS)
Makes AI models learn faster and remember more.
Sliding Window Attention Adaptation
Computation and Language
Lets computers understand long stories faster.