Score: 1

Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Published: December 23, 2025 | arXiv ID: 2512.20198v1

By: Huizheng Wang , Taiquan Wei , Hongbin Wang and more

Potential Business Impact:

Makes AI understand long sentences faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$\times$ speedup and 71.2$\times$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$\times$ energy and 27.1$\times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$\times$ throughput improvement.

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Machine Learning (CS)

Makes AI faster and use less power.

6 Dec 2025 1

88%

Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

Computation and Language

Makes AI understand long stories faster.

9 May 2025 1

88%

STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

Machine Learning (CS)

Speeds up AI that understands pictures and words.

18 May 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

15 pages

Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Makes AI understand long sentences faster.

Technical Abstract

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference