Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling
By: Huizheng Wang , Taiquan Wei , Hongbin Wang and more
Potential Business Impact:
Makes AI understand long sentences faster.
Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$\times$ speedup and 71.2$\times$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$\times$ energy and 27.1$\times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$\times$ throughput improvement.
Similar Papers
BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination
Machine Learning (CS)
Makes AI faster and use less power.
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
Computation and Language
Makes AI understand long stories faster.
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
Machine Learning (CS)
Speeds up AI that understands pictures and words.