Score: 1

PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

Published: December 16, 2025 | arXiv ID: 2512.14322v1

By: Huizheng Wang , Hongbin Wang , Zichuan Wang and more

Potential Business Impact:

Makes AI faster and use less power.

Business Areas:

Field-Programmable Gate Array (FPGA) Hardware

Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency. This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria. We propose PADE, a predictor-free algorithm-hardware co-design for dynamic sparse attention acceleration. PADE features three key innovations: 1) Bit-wise uncertainty interval-enabled guard filtering (BUI-GF) strategy to accurately identify trivial tokens during each bit round; 2) Bidirectional sparsity-based out-of-order execution (BS-OOE) to improve hardware utilization; 3) Interleaving-based sparsity-tiled attention (ISTA) to reduce both I/O and computational complexity. These techniques, combined with custom accelerator designs, enable practical sparsity acceleration without relying on an added sparsity predictor. Extensive experiments on 22 benchmarks show that PADE achieves 7.43x speed up and 31.1x higher energy efficiency than Nvidia H100 GPU. Compared to SOTA accelerators, PADE achieves 5.1x, 4.3x and 3.4x energy saving than Sanger, DOTA and SOFA.

SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference

Hardware Architecture

Makes AI chat faster and cheaper to run.

9 Oct 2025 1

87%

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Machine Learning (CS)

Makes AI faster and use less power.

6 Dec 2025 1

86%

Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving

Machine Learning (CS)

Makes AI understand long texts faster and cheaper.

1 Mar 2025 3

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

18 pages

PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

Makes AI faster and use less power.

Technical Abstract

SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving