Score: 2

Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Published: August 25, 2025 | arXiv ID: 2508.18224v1

By: Ran Yan, Youhe Jiang, Binhang Yuan

Potential Business Impact:

Makes AI understand more words faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent progress in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), a state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance gains while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA relies on a query-grouping strategy that is efficient only with large Grouped Query Attention (GQA) sizes, whereas modern LLMs typically adopt much smaller GQA groups, which limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), which includes an alternative kernel design that enables efficient NSA computation across a wide range of popular LLMs with varied smaller GQA group sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5$\times$ and on average 1.6$\times$ kernel-level latency reduction, (ii) up to 1.25$\times$ and 1.09$\times$ on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36$\times$ and 1.11$\times$ on average end-to-end prefill speedup on state-of-the-art LLMs. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Computation and Language

Makes AI understand long stories better, faster.

25 Nov 2025 1

91%

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Computation and Language

Makes computers understand long stories better.

2 Nov 2025 0

89%

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

CV and Pattern Recognition

Makes AI models run much faster without changes.

23 Apr 2025 2

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Repos / Data Links

github.com

Page Count

18 pages

Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Makes AI understand more words faster.

Technical Abstract

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light