Score: 2

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Published: October 20, 2025 | arXiv ID: 2510.17196v1

By: Jiaqi Leng , Xiang Hu , Junxiong Wang and more

Potential Business Impact:

Lets computers understand much longer stories.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Computation and Language

Lets computers remember much longer stories.

28 Nov 2025 1

90%

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Computation and Language

Makes AI understand long texts faster and cheaper.

28 Oct 2025 2

89%

ScaleFormer: Span Representation Cumulation for Long-Context Transformer

Computation and Language

Lets computers understand long stories better.

13 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇺🇸 United States, China

Page Count

19 pages

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Lets computers understand much longer stories.

Technical Abstract

Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

ScaleFormer: Span Representation Cumulation for Long-Context Transformer