Critical attention scaling in long-context transformers
By: Shi Chen , Zhengjiang Lin , Yury Polyanskiy and more
Potential Business Impact:
Makes AI understand longer stories better.
As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.
Similar Papers
Token Sample Complexity of Attention
Machine Learning (CS)
Makes AI understand longer stories better.
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
Computation and Language
Lets computers understand much longer stories.
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Computation and Language
Lets computers understand much longer stories.