Score: 0

A Digital SRAM-Based Compute-In-Memory Macro for Weight-Stationary Dynamic Matrix Multiplication in Transformer Attention Score Computation

Published: November 15, 2025 | arXiv ID: 2511.12152v3

By: Jianyi Yu , Tengxiao Wang , Yuxuan Wang and more

Potential Business Impact:

Makes AI faster and use less power.

Business Areas:

DSP Hardware

Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. To extend these benefits to Transformer, this brief proposes a digital CIM macro to compute attention score. To eliminate dynamic matrix multiplication (MM), we reconstruct the computation as static MM using a combined QK-weight matrix, so that inputs can be directly fed to a single CIM macro to obtain the score results. However, this introduces a new challenge of 2-input static MM. The computation is further decomposed into four groups of bit-serial logical and addition operations. This allows 2-input to directly activate the word line via AND gate, thus realizing 2-input static MM with minimal overhead. A hierarchical zero-value bit skipping mechanism is introduced to prioritize skipping zero-value bits in the 2-input case. This mechanism effectively utilizes data sparsity of 2-input, significantly reducing redundant operations. Implemented in a 65-nm process, the 0.35 mm2 macro delivers 42.27 GOPS at 1.24 mW, yielding 34.1 TOPS/W energy and 120.77 GOPS/mm2 area efficiency. Compared to CPUs and GPUs, it achieves ~25x and ~13x higher efficiency, respectively. Against other Transformer-CIMs, it demonstrates at least 7x energy and 2x area efficiency gains, highlighting its strong potential for edge intelligence.

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

Hardware Architecture

Makes AI faster and use less power.

15 Nov 2025 0

99%

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

Hardware Architecture

Makes AI faster and use less power.

15 Nov 2025 0

90%

Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory

Hardware Architecture

Makes AI learn faster and use less power.

17 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

5 pages

A Digital SRAM-Based Compute-In-Memory Macro for Weight-Stationary Dynamic Matrix Multiplication in Transformer Attention Score Computation

Makes AI faster and use less power.

Technical Abstract

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory