Score: 0

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

Published: November 15, 2025 | arXiv ID: 2511.12152v2

By: Jianyi Yu , Tengxiao Wang , Yuxuan Wang and more

Potential Business Impact:

Makes AI faster and use less power.

Business Areas:

Quantum Computing Science and Engineering

Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. This work proposes a digital CIM macro to compute Transformer attention. To mitigate dynamic matrix multiplication that is unsuitable for the common weight-stationary CIM paradigm, we reformulate the attention score computation process based on a combined QK-weight matrix, so that inputs can be directly fed to CIM cells to obtain the score results. Moreover, the involved binomial matrix multiplication operation is decomposed into 4 groups of bit-serial shifting and additions, without costly physical multipliers in the CIM. We maximize the energy efficiency of the CIM circuit through zero-value bit-skipping, data-driven word line activation, read-write separate 6T cells and bit-alternating 14T/28T adders. The proposed CIM macro was implemented using a 65-nm process. It occupied only 0.35 mm2 area, and delivered a 42.27 GOPS peak performance with 1.24 mW power consumption at a 1.0 V power supply and a 100 MHz clock frequency, resulting in 34.1 TOPS/W energy efficiency and 120.77 GOPS/mm2 area efficiency. When compared to the CPU and GPU, our CIM macro is 25x and 13x more energy efficient on practical tasks, respectively. Compared with other Transformer-CIMs, our design exhibits at least 7x energy efficiency and at least 2x area efficiency improvements when scaled to the same technology node, showcasing its potential for edge-side intelligent applications.

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

Hardware Architecture

Makes AI faster and use less power.

15 Nov 2025 0

99%

A Digital SRAM-Based Compute-In-Memory Macro for Weight-Stationary Dynamic Matrix Multiplication in Transformer Attention Score Computation

Hardware Architecture

Makes AI faster and use less power.

15 Nov 2025 0

91%

Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory

Hardware Architecture

Makes AI learn faster and use less power.

17 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

6 pages

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

Makes AI faster and use less power.

Technical Abstract

A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

A Digital SRAM-Based Compute-In-Memory Macro for Weight-Stationary Dynamic Matrix Multiplication in Transformer Attention Score Computation

Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory