Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models
By: Junxiang Qiu , Shuo Wang , Zhengsu Chen and more
Potential Business Impact:
Helps computers understand long stories better.
Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.
Similar Papers
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Computation and Language
Makes AI understand long stories better, faster.
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
Computation and Language
Makes AI understand long texts faster and cheaper.
Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off
Machine Learning (CS)
Makes AI understand long texts much faster.