Score: 1

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Published: December 6, 2025 | arXiv ID: 2512.06457v1

By: Huizheng Wang , Hongbin Wang , Shaojun Wei and more

Potential Business Impact:

Makes AI faster and use less power.

Business Areas:

Semantic Search Internet Services

Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.

Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow

Hardware Architecture

Makes computer vision faster and use less power.

16 Oct 2025 0

86%

Bidirectional Sparse Attention for Faster Video Diffusion Training

CV and Pattern Recognition

Makes video creation faster and cheaper.

1 Sep 2025 1

86%

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

Hardware Architecture

Makes AI answer questions much faster and use less power.

12 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

8 pages

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Makes AI faster and use less power.

Technical Abstract

Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow

Bidirectional Sparse Attention for Faster Video Diffusion Training

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness