Score: 1

SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Published: June 10, 2025 | arXiv ID: 2506.08297v1

By: Nhat Thanh Tran , Fanghui Xue , Shuai Zhang and more

BigTech Affiliations: Qualcomm

Potential Business Impact:

Makes computer vision models see better, faster.

Business Areas:

SEM Advertising, Internet Services, Sales and Marketing

Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.

SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

Machine Learning (CS)

Makes computers understand long stories faster.

31 Aug 2025 4

87%

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

Computation and Language

Makes AI understand long stories faster and cheaper.

16 Dec 2025 0

87%

A Separable Self-attention Inspired by the State Space Model for Computer Vision

CV and Pattern Recognition

Makes computers see pictures faster and better.

3 Jan 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

15 pages

SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Makes computer vision models see better, faster.

Technical Abstract

SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

A Separable Self-attention Inspired by the State Space Model for Computer Vision