Score: 0

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Published: September 12, 2025 | arXiv ID: 2509.10406v2

By: Rupert Mitchell, Kristian Kersting

Potential Business Impact:

Makes computers understand long texts faster.

Business Areas:

Semantic Search Internet Services

We present Multipole Semantic Attention (MuSe), an efficient approximation of softmax attention that combines semantic clustering with multipole expansions from computational physics. Our method addresses the quadratic computational complexity of transformers in the context length by clustering queries and keys separately in their learned representation spaces, enabling a hierarchical two-stage attention mechanism. Unlike prior clustering approaches that group only keys or use unified clustering, we maintain separate clusterings that respect attention's asymmetric treatment of these spaces. We augment centroid-based (monopole) approximations with dipole corrections that capture directional variance within clusters, preserving richer information during training. The method operates as a drop-in replacement for standard attention, requiring only hyperparameter specification without architectural modifications. Our approach achieves $\mathcal{O}(NCD)$ complexity for acausal attention with $C$ clusters and $\mathcal{O}(NCD \log N)$ for causal attention. On isolated attention layers, we demonstrate $3\times$ speedup over CUDNN Flash Attention at 8k context length, with relative squared errors below 20%. For causal attention, we develop a hierarchical block decomposition that combines exact local computation with efficient long-range approximation. In end-to-end pretraining of a 30M parameter model on book-length texts with 16k context, we achieve 12.2% runtime reduction with only 0.36% loss degradation, establishing the viability of multipole approximations for efficient transformer pretraining.

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Machine Learning (CS)

Makes computers understand long texts faster.

12 Sep 2025 0

89%

Multipole Attention for Efficient Long Context Reasoning

Computation and Language

Makes smart computers think faster and better.

16 Jun 2025 2

86%

SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

CV and Pattern Recognition

Makes computer vision models see better, faster.

10 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Page Count

14 pages

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Makes computers understand long texts faster.

Technical Abstract

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Multipole Attention for Efficient Long Context Reasoning

SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging