GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
By: Jiaxu Liu, Yuhe Bai, Christos-Savvas Bouganis
Potential Business Impact:
Makes AI models learn faster and remember more.
Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.
Similar Papers
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Machine Learning (CS)
Lets computers learn better by choosing important words.
ENA: Efficient N-dimensional Attention
Machine Learning (CS)
Helps computers understand complex, long data faster.
Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling
Computation and Language
Lets computers understand long stories faster.