Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs
By: Zhantong Zhu , Hongou Li , Wenjie Ren and more
Potential Business Impact:
Makes AI run faster and use less power.
With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.
Similar Papers
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
Hardware Architecture
Makes AI understand faster on special chips.
CIMinus: Empowering Sparse DNN Workloads Modeling and Exploration on SRAM-based CIM Architectures
Hardware Architecture
Helps computers learn faster by using less energy.
A Tensor Compiler for Processing-In-Memory Architectures
Hardware Architecture
Makes AI models run much faster on new chips.