Score: 0

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Published: November 2, 2025 | arXiv ID: 2511.00819v1

By: Yuxuan Hu , Jianchao Tan , Jiaqi Zhang and more

Potential Business Impact:

Makes computers understand long stories better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA's branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50\% versus NSA while improving the model's common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Computation and Language

Makes AI understand long stories better, faster.

25 Nov 2025 1

91%

Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Distributed, Parallel, and Cluster Computing

Makes AI understand more words faster.

25 Aug 2025 2

90%

VideoNSA: Native Sparse Attention Scales Video Understanding

CV and Pattern Recognition

Lets computers watch and understand long videos better.

2 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

12 pages

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Makes computers understand long stories better.

Technical Abstract

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel

VideoNSA: Native Sparse Attention Scales Video Understanding