Score: 1

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Published: November 25, 2025 | arXiv ID: 2511.20102v1

By: Zhenyi Shen , Junru Lu , Lin Gui and more

Potential Business Impact:

Makes AI understand long stories better, faster.

Business Areas:

Semantic Search Internet Services

The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Computation and Language

Makes computers understand long stories better.

2 Nov 2025 0

91%

Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Distributed, Parallel, and Cluster Computing

Makes AI understand more words faster.

25 Aug 2025 2

91%

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Machine Learning (CS)

Makes AI understand long texts much faster.

12 Nov 2025 2

View PDF Login to Bookmark

Page Count

26 pages

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Makes AI understand long stories better, faster.

Technical Abstract

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off