Score: 2

Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

Published: January 6, 2026 | arXiv ID: 2601.02819v1

By: Junxiang Qiu , Shuo Wang , Zhengsu Chen and more

BigTech Affiliations: Huawei

Potential Business Impact:

Helps computers understand long stories better.

Business Areas:

Semantic Search Internet Services

Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Computation and Language

Makes AI understand long stories better, faster.

25 Nov 2025 1

90%

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Computation and Language

Makes AI understand long texts faster and cheaper.

28 Oct 2025 2

89%

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Machine Learning (CS)

Makes AI understand long texts much faster.

12 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

Helps computers understand long stories better.

Technical Abstract

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off