Score: 0

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Published: September 16, 2025 | arXiv ID: 2509.12817v1

By: Yuan Cao, Dong Wang

Potential Business Impact:

Makes computers see clearer, faster, and with less memory.

Business Areas:

Field-Programmable Gate Array (FPGA) Hardware

While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

Machine Learning (CS)

Makes AI learn faster with fewer calculations.

2 Oct 2025 1

88%

Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

Machine Learning (CS)

Fixes AI mistakes by controlling its focus.

10 Oct 2025 0

88%

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

Machine Learning (CS)

Makes AI models remember more without using more memory.

5 Dec 2025 0

View PDF Login to Bookmark

Page Count

9 pages

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Makes computers see clearer, faster, and with less memory.

Technical Abstract

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity