Score: 0

Critical attention scaling in long-context transformers

Published: October 7, 2025 | arXiv ID: 2510.05554v1

By: Shi Chen , Zhengjiang Lin , Yury Polyanskiy and more

Potential Business Impact:

Makes AI understand longer stories better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.

Token Sample Complexity of Attention

Machine Learning (CS)

Makes AI understand longer stories better.

11 Dec 2025 0

87%

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

Computation and Language

Lets computers understand much longer stories.

5 Mar 2025 0

87%

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Computation and Language

Lets computers understand much longer stories.

20 Oct 2025 2

View PDF Login to Bookmark

Page Count

29 pages

Critical attention scaling in long-context transformers

Makes AI understand longer stories better.

Technical Abstract

Token Sample Complexity of Attention

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models