Score: 2

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Published: August 9, 2025 | arXiv ID: 2508.07101v1

By: Lijie Yang , Zhihao Zhang , Arti Jain and more

BigTech Affiliations: Microsoft Princeton University

Potential Business Impact:

Makes smart computers think faster with less effort.

Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a $1.1\times$ average decoding speed-up compared to full attention. Moreover, LessIsMore attends to $2\times$ fewer tokens without accuracy loss, achieving a $1.13\times$ end-to-end speed-up compared to existing sparse attention methods.

Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

Computation and Language

Saves computer time by stopping early.

6 Jan 2026 1

89%

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Machine Learning (CS)

Makes AI understand long texts much faster.

12 Nov 2025 2

89%

A Unified Sparse Attention via Multi-Granularity Compression

Computation and Language

Makes AI understand long texts much faster.

16 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

18 pages

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Makes smart computers think faster with less effort.

Technical Abstract

Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

A Unified Sparse Attention via Multi-Granularity Compression